QuCo-RAG Boosts Dynamic RAG Performance by 14 Points

A joint research team from the University of Illinois Chicago, New York University, and Monash University has introduced QuCo-RAG, a novel approach to Dynamic Retrieval-Augmented Generation (RAG). This method quantifies uncertainty using objective statistics derived from pre-training corpora, moving away from reliance on a large language model's (LLM) internal signals.

QuCo-RAG achieved a 5-14 Exact Match (EM) point improvement on OLMo series models across multi-hop question-answering benchmarks. The method's effectiveness also extended to models with undisclosed pre-training data, including Llama3, Qwen2.5, and GPT4.1/5.

Addressing LLM Hallucination

Dynamic RAG aims to mitigate LLM hallucination by adaptively determining when to retrieve information. Existing methods typically depend on internal model signals such as logits, entropy, or attention. However, LLMs often exhibit poor signal calibration, frequently displaying high confidence in incorrect outputs, a phenomenon termed "confident hallucination." Recent theoretical work by Kalai & Vempala (2024) suggests that even perfectly calibrated models may hallucinate for rare facts to maintain statistical consistency.

Translucent brain model with red glow for hallucination and blue for factual knowledge.

QuCo-RAG's core insight is that an LLM's factual knowledge is shaped by its pre-training corpus. The system identifies two key indicators of potential hallucination:

Low-frequency entities: If an entity appears infrequently in the pre-training corpus, the model may struggle to reliably recall information about it.
Zero co-occurrence: If two entities never appear together in the pre-training corpus, any claimed relationship between them by the model is likely a hallucination, as it lacks supporting evidence from the training data.

This approach shifts from assessing subjective internal confidence to objective corpus statistics.

Vast data network with a digital magnifying glass examining a single data point.

QuCo-RAG Framework and Implementation

QuCo-RAG employs a two-stage detection mechanism to quantify uncertainty:

Pre-Generation Knowledge Assessment

Before an LLM begins generating a response, QuCo-RAG extracts key entities from the input question. It then queries the occurrence frequency of each entity within the 4 trillion-token pre-training corpus. If the average frequency falls below a predefined threshold (e.g., 1,000 times), retrieval is triggered. This stage targets "long-tail knowledge" associated with low-frequency entities.

Runtime Claim Verification

During the generation process, QuCo-RAG continuously monitors each generated sentence. A lightweight 0.5B model extracts knowledge triplets (head entity, relation, tail entity). The system then queries the co-occurrence count of the head and tail entities in the pre-training corpus. If the co-occurrence count is zero, retrieval is triggered, and the generation is regenerated. This stage identifies instances where the model fabricates relationships not present in the training data.

To enable real-time queries on a 4 trillion-token corpus, QuCo-RAG utilizes the Infini-gram engine, an index system based on suffix arrays that supports millisecond-level frequency and co-occurrence queries. The lightweight triplet extraction model was distilled from GPT-4o-mini and fine-tuned on Qwen2.5-0.5B-Instruct. Analysis of component runtime shows that LLM generation dominates (55-74%), while Infini-gram queries account for 18-31%, indicating moderate overhead.

Digital dashboard displaying 'EM Improvement' graph and low retrieval counts.

Experimental Results and Transferability

QuCo-RAG demonstrated consistent performance improvements across various models and datasets.

On the OLMo-2 series (7B, 13B, 32B), QuCo-RAG achieved EM improvements of 5-12 points on multi-hop QA benchmarks such as 2WikiMultihopQA and HotpotQA. For example, OLMo-2-13B saw a 12.0 EM increase on 2WikiMultihopQA. In contrast, internal signal-based methods like FLARE and DRAGIN showed unstable performance, sometimes underperforming simple single-round retrieval (SR-RAG). The OLMo-2 series was chosen for primary experiments due to its publicly available 4 trillion-token pre-training corpus, which allows for accurate calculation of entity statistics.

The research also explored cross-model transferability, addressing scenarios where a model's pre-training data is not public. By using OLMo-2's corpus as a "proxy corpus," QuCo-RAG achieved significant improvements on models like Qwen2.5, Llama-3, GPT-4.1, and GPT-5. For instance, Qwen2.5-32B showed a 14.1 EM improvement on 2WikiMultihopQA, and GPT-5-chat saw an 8.7 EM improvement on the same benchmark.

QuCo-RAG also demonstrated efficiency, averaging only 1.70 retrievals per question, consuming 87 tokens, and making 1.84 LLM calls. This contrasts with methods like FS-RAG and DRAGIN, which consumed 2-4 times more tokens for lower performance.

In the biomedical domain, QuCo-RAG achieved 66.4% accuracy on the PubMedQA benchmark, surpassing Wo-RAG by 11.2 percentage points. Internal signal methods in this specialized domain exhibited either over-retrieval or under-retrieval, leading to inconsistent accuracy.

Hand placing a glowing puzzle piece into a digital puzzle, symbolizing AI insights.

Implications and Future Directions

The study establishes corpus statistics as an objective alternative to internal model uncertainty signals, offering a more reliable measure of uncertainty. This paradigm shift has implications for AI safety and robustness beyond RAG systems.

Potential applications include selective answering, where models can refuse to answer questions with insufficient evidence, and correctness prediction, where corpus statistics provide evidence-based confidence scores for generated claims.

Corpus statistical analysis can also guide data-centric AI by identifying knowledge blind spots. This signal can inform training data curation, synthetic data filtering, and model editing.

Future research directions include multilingual verification, handling knowledge evolution with timestamped corpora, extending the method to verify events, relationships, and numerical claims, and integrating QuCo-RAG as a self-verification tool in agent systems. The effectiveness of cross-model transfer also raises questions about information-theoretic bounds for hallucination probability given corpus statistics.