Leveraging the Table of Contents to Improve Efficiency in an RAG-Powered Chatbot
DOI:
https://doi.org/10.13021/jssr2025.5163Abstract
Sectors have increasingly demanded document processing, drawing attention to Retrieval-Augmented Generation (RAG) systems, which leverage large language models and document search to provide answers exclusively from a specific knowledge base. However, RAG with large documents can be inefficient, with weak points such as high latency and memory usage. This research aims to use the Table of Contents(TOC) to narrow the search space. The techniques examined are Keyword-First Pre-filtering and TOC Query Routing. Keyword-First Pre-filtering uses query-TOC keyword matching to cut out irrelevant parts of the document, dramatically reducing unnecessary computation. If keywords do not exactly match, this technique can fall back on semantic matching. Conversely, TOC Query Routing involves processing the whole document but using the TOC to dynamically guide focus to certain sections. To test impact, 27 natural language queries relating to the DAFMAN 36-2664 policy document, created by Lt. Col. John McKee, were fed into each model. Accuracy, average latency, and average memory usage across all models were recorded. The following figures are derived from a comparison with a near-identical model that used brute-force RAG. The base model used Apache Tika for text extraction, then parsed and chunked all the information. These chunks were embedded through the all-MiniLM-L6-v2 sentence-transformer model and stored with ChromaDB. Using Meta's LLaMA 3 8B model, the base model was then able to generate context-aware answers. It was found that Keyword-First Pre-filtering could reduce latency by 54% and memory usage by 87%, while maintaining answer accuracy. TOC Query Routing decreased latency by 32% while maintaining answer accuracy. These findings suggest that TOC-driven strategies can significantly improve the efficiency of RAG systems without compromising accuracy(with Keyword-First Pre-filtering being especially promising), making them ideal for environments like the Department of the Air Force, where speed and resource constraints are critical. This research could be expanded upon by exploring how search spaces can be limited effectively for queries that involve chunks from disparate areas in the document.
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.