Large Language Model Selection for Dynamic LLM-GNN Integration

Authors

  • Nathan Chu Thomas Jefferson High School for Science and Technology, Alexandria, VA
  • Shubham Patel Chantilly High School, Chantilly, VA
  • Aryan Raj Chantilly High School, Chantilly, VA
  • Anishka Mohanty Mathematical Sciences Department, Carnegie Mellon University, Pittsburgh, PA
  • Mihai Boicu Department of Information Sciences and Technology, George Mason University, Fairfax, VA

DOI:

https://doi.org/10.13021/jssr2025.5172

Abstract

Graph Neural Networks (GNNs) have achieved state-of-the-art results across various domains; however, designing optimal GNN architectures with modifications remains computationally expensive, requiring manual tuning or tedious optimizations. The Prompt-Responsive GNN (PR-GNN) model introduces a novel architecture that leverages real-time LLM feedback to iteratively adjust GNN components based on evolving data or task demands. Given the LLM’s role in this model, selecting the optimal one is essential for maximizing performance. This research proposes an initial framework for comparing four LLMs (ChatGPT-4o, Claude Sonnet 4, Gemini 2.5 Pro, and Grok 3), selected based on current benchmarks. The preliminary methodology included 36 prompts of varying complexities (beginner, intermediate, advanced). These prompts were used to evaluate the LLM’s parsing success. All tested LLMs achieved perfect parsing; therefore, the study adopted an alternative approach that utilized public Kaggle datasets with standardized scikit-learn accuracy metrics. Each LLM generated prompts from the dataset descriptions and code for evaluation in multiple real-world domains, including finance, cybersecurity, and healthcare, to minimize domain-specific bias. Initial results revealed that Claude Sonnet 4 (89.15%) and Grok 3 (89.07%) were statistically equivalent top performers, while Gemini 2.5 Pro and ChatGPT-4o trailed, with 82.86% and 48.63%, respectively. A subsequent assessment of 22 new tasks between Grok 3 and Claude Sonnet 4 revealed nuanced differences (65.73% vs. 62.17%), with Grok 3 slightly outperforming. Additional statistical analysis using 10% trimmed means (86.58% vs 81.55%) confirmed Grok 3's slight advantage by excluding extreme outliers. These findings identify Grok 3 as the preferred choice for integration during further development of the PR-GNN model. The next phase will focus on building a scalable pipeline, assessing performance using fine-grained metrics, and validating Grok 3’s effectiveness through comparative analysis and robustness testing.

Published

2025-09-25

Issue

Section

College of Engineering and Computing: Department of Information Sciences and Technology