A Retrieval-Augmented Generation-Powered Chatbot for Air Force Policy and Logistics Compliance

Authors

  • Vedant Mahawar Eastside Preparatory School, Kirkland, WA
  • Sanya Bhalla Thomas Jefferson High School for Science and Technology, Alexandria, VA
  • Rounawk Sinha Chantilly High School, Greenbriar, VA
  • Jiwoo Lee Thomas Jefferson High School for Science and Technology, Alexandria, VA
  • Sheryl Zhang Fairfax High School, Fairfax, VA
  • Lieutenant Colonel John McKee Air Force CyberWorx, Air Force Academy, CO
  • Mihai Boicu Department of Information Sciences and Technology, George Mason University, Fairfax, VA
  • Kamaljeet Sanghera Department of Information Sciences and Technology, George Mason University, Fairfax, VA

DOI:

https://doi.org/10.13021/jssr2025.5178

Abstract

Accelerated innovation from foreign adversaries and recent personnel cuts mandated by the U.S. government have intensified time pressure on U.S. Department of the Air Force (DAF) personnel, highlighting the need to reduce administrative overload. A specific pain point involves the over 11,400 regulatory publications that the DAF must interpret and comply with. Previous works have shown potential in using Retrieval-Augmented Generation (RAG) chatbots for complex regulatory documents, and the Air Force Research Laboratory has attempted to implement a chatbot for the U.S. Armed Forces, NIPRGPT. However, this solution is slow, computationally intensive, and prone to hallucinations. As a result, this work aims to provide a faster alternative that can answer complex questions about the aforementioned regulatory publications with specific citations.

 

Because of its large scope and recent updates, DAF Manual 36-2664 was selected for testing with 17 questions spanning various difficulty levels, plain text, tables, and images. Due to the bullet structure of the document, semantic chunking was employed to divide the document with an average chunk length of 63 words. Several bi-encoding Sentence Transformer models were evaluated, with the all-mpnet-base-v2 model achieving the highest mean reciprocal rank (MRR) of 0.418, outperforming the next best model by 0.072. Cross-encoding with the ms-marco-MiniLM-L6-v2 model further improved MRR to 0.528. Then, due to its strong benchmark performance, the Llama-3.1-8B-Instruct small language model (SLM) was integrated into the system, enabling concise human-like answers to queries. Ten additional questions regarding DAF Manual 36-2664 were formulated, and the system achieved 87.80% of NIPRGPT’s accuracy while responding 18.18x faster, a 15.96x increase in overall utility. Further testing on a different DAF document, DAF Instruction 36-2903, showed similar results. On 12 questions, the system achieved 255.56% of NIPRGPT’s accuracy and responded 9.43x faster, resulting in a 24.10x overall improvement. These findings suggest that small-scale RAG systems can meet the DAF’s growing need to reduce administrative overload. Future research would explore agentic RAG, where another SLM selects the best RAG techniques based on document characteristics and system demands.

Published

2025-09-25

Issue

Section

College of Engineering and Computing: Department of Information Sciences and Technology