A Retrieval-Augmented Generation-Powered Chatbot for Air Force Policy and Logistics Compliance
DOI:
https://doi.org/10.13021/jssr2025.5178Abstract
Accelerated innovation from foreign adversaries and recent personnel cuts mandated by the U.S. government have intensified time pressure on U.S. Department of the Air Force (DAF) personnel, highlighting the need to reduce administrative overload. A specific pain point involves the over 11,400 regulatory publications that the DAF must interpret and comply with. Previous works have shown potential in using Retrieval-Augmented Generation (RAG) chatbots for complex regulatory documents, and the Air Force Research Laboratory has attempted to implement a chatbot for the U.S. Armed Forces, NIPRGPT. However, this solution is slow, computationally intensive, and prone to hallucinations. As a result, this work aims to provide a faster alternative that can answer complex questions about the aforementioned regulatory publications with specific citations.
Because of its large scope and recent updates, DAF Manual 36-2664 was selected for testing with 17 questions spanning various difficulty levels, plain text, tables, and images. Due to the bullet structure of the document, semantic chunking was employed to divide the document with an average chunk length of 63 words. Several bi-encoding Sentence Transformer models were evaluated, with the all-mpnet-base-v2 model achieving the highest mean reciprocal rank (MRR) of 0.418, outperforming the next best model by 0.072. Cross-encoding with the ms-marco-MiniLM-L6-v2 model further improved MRR to 0.528. Then, due to its strong benchmark performance, the Llama-3.1-8B-Instruct small language model (SLM) was integrated into the system, enabling concise human-like answers to queries. Ten additional questions regarding DAF Manual 36-2664 were formulated, and the system achieved 87.80% of NIPRGPT’s accuracy while responding 18.18x faster, a 15.96x increase in overall utility. Further testing on a different DAF document, DAF Instruction 36-2903, showed similar results. On 12 questions, the system achieved 255.56% of NIPRGPT’s accuracy and responded 9.43x faster, resulting in a 24.10x overall improvement. These findings suggest that small-scale RAG systems can meet the DAF’s growing need to reduce administrative overload. Future research would explore agentic RAG, where another SLM selects the best RAG techniques based on document characteristics and system demands.
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.