Large Language Model Selection for Dynamic LLM-GNN Integration
DOI:
https://doi.org/10.13021/jssr2025.5172Abstract
Graph Neural Networks (GNNs) have achieved state-of-the-art results across various domains; however, designing optimal GNN architectures with modifications remains computationally expensive, requiring manual tuning or tedious optimizations. The Prompt-Responsive GNN (PR-GNN) model introduces a novel architecture that leverages real-time LLM feedback to iteratively adjust GNN components based on evolving data or task demands. Given the LLM’s role in this model, selecting the optimal one is essential for maximizing performance. This research proposes an initial framework for comparing four LLMs (ChatGPT-4o, Claude Sonnet 4, Gemini 2.5 Pro, and Grok 3), selected based on current benchmarks. The preliminary methodology included 36 prompts of varying complexities (beginner, intermediate, advanced). These prompts were used to evaluate the LLM’s parsing success. All tested LLMs achieved perfect parsing; therefore, the study adopted an alternative approach that utilized public Kaggle datasets with standardized scikit-learn accuracy metrics. Each LLM generated prompts from the dataset descriptions and code for evaluation in multiple real-world domains, including finance, cybersecurity, and healthcare, to minimize domain-specific bias. Initial results revealed that Claude Sonnet 4 (89.15%) and Grok 3 (89.07%) were statistically equivalent top performers, while Gemini 2.5 Pro and ChatGPT-4o trailed, with 82.86% and 48.63%, respectively. A subsequent assessment of 22 new tasks between Grok 3 and Claude Sonnet 4 revealed nuanced differences (65.73% vs. 62.17%), with Grok 3 slightly outperforming. Additional statistical analysis using 10% trimmed means (86.58% vs 81.55%) confirmed Grok 3's slight advantage by excluding extreme outliers. These findings identify Grok 3 as the preferred choice for integration during further development of the PR-GNN model. The next phase will focus on building a scalable pipeline, assessing performance using fine-grained metrics, and validating Grok 3’s effectiveness through comparative analysis and robustness testing.
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.