Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models
On the Scoring Functions for RAG-based Conformal Factuality
Yi Chen · Caitlyn Yin · Sukrut Chikodikar · Ramya Vinayak
Keywords: [ conformal prediction ] [ scoring function ] [ conformal factuality ] [ RAG ] [ llm ]
Large language models (LLMs), despite their effectiveness, often produce hallucinated or non-factual outputs. To mitigate this, conformal factuality frameworks utilize scoring functions to filter model-generated claims and provide statistical factuality guarantees. This study systematically investigates various scoring functions in a retrieval-augmented generation (RAG) context, where a reference text and query inform the LLM’s responses. We evaluate three distinct scoring methods—non-reference model confidence, reference model confidence, and entailment scores—using empirical factuality, power, and false positive rates as metrics. Additionally, we assess the robustness of these scoring functions when the assumption of data exchangeability is mildly violated by incorporating deliberately hallucinated claims. Our findings reveal that reference model confidence scores generally outperform other methods by achieving higher power and improved robustness. However, entailment-based scoring shows the lowest false positive rates under conditions of induced hallucinations. This work highlights the critical importance of scoring function selection to enhance factuality and robustness in RAG-based conformal frameworks.