Retrieval-Augmented Generation (RAG) should make your LLM safer and more factual, yet in production many teams still get burned by subtle hallucinations. As of November 2025, leading APIs like OpenAI’s GPT‑5.x chat completions and Google’s Gemini 2.5 models expose token-level log probabilities (“logprobs”), giving you a powerful signal: how confident the model really was in each piece of its answer. This guide shows how to turn that raw signal into a practical hallucination detector for RAG. You’ll learn how logprobs work, how to wire them into modern APIs, and how to build a lightweight confidence filter that keeps dubious, fabricated answers away from your users.
Evergreen context: why logprobs matter for RAG hallucinations
Hallucinations in RAG systems are especially damaging because users assume retrieved citations guarantee correctness. Research in 2024–2025 on hallucination detection in RAG (e.g., Cleanlab’s RAG benchmarking, Stanford’s legal RAG study, ReDeEP 2025) repeatedly finds that:
- RAG reduces hallucinations but does not eliminate them.
- Heuristic detectors (e.g., pattern-matching on citations) miss many subtle fabrications.
- Uncertainty signals from the model itself (token probabilities, entropy, margin scores) are strong predictors of wrong answers.
At the same time, platform support for logprobs has matured:
- OpenAI: The Chat Completions API (e.g.,
gpt‑4.1‑2025‑04‑14,gpt‑5.1) supportslogprobsandtop_logprobsfor token-level probabilities, per the 2025 API reference. - Google Gemini: A July 2025 Google Developers post introduces
response_logprobsandlogprobsfor Gemini 2.5 models on Vertex AI, including a quantitative RAG evaluation example. - Local / open models: Frameworks like Ollama (v0.12+), vLLM, and others expose token logprobs in their APIs or logs.
This makes it realistic to add logprobs-based hallucination detection to production RAG without exotic infrastructure or extra models.
Logprobs 101: turning token likelihoods into confidence scores
A language model predicts the next token with a probability between 0 and 1. Instead of returning that directly, most APIs expose the natural logarithm of the probability, the log probability or logprob:
- Probability 1.0 → logprob 0.0 (max confidence)
- Probability 0.5 → logprob ≈ −0.69
- Probability 0.01 → logprob ≈ −4.6
Key implications for hallucination detection:
- Closer to zero = higher confidence. More negative values mean the model was less sure about that token.
- Sequence logprob is the sum of token logprobs; average sequence logprob is that sum divided by token count, which normalizes for length.
- Local dips (a cluster of low-probability tokens) often mark semantic uncertainty, even when the overall answer looks fluent.
“A logprob score closer to 0 indicates higher model confidence in its choice.”
Google Developers Blog, “Unlock Gemini’s reasoning with logprobs,” July 2025
For RAG hallucinations, we care about where the model is uncertain, not just whether the entire answer was low-confidence. That means we’ll:
- Compute global confidence (average logprob across the answer) for simple allow/deny.
- Compute span-level confidence to flag specific sentences or facts as dubious.
How to get logprobs from modern LLM APIs
OpenAI Chat Completions (gpt‑4.1, gpt‑5.x)
As of 2025, OpenAI’s Chat Completions endpoint supports logprobs and top_logprobs for most chat models:
from openai import OpenAI
client = OpenAI()
completion = client.chat.completions.create(
model="gpt-4.1-2025-04-14",
messages=[
{"role": "system", "content": "You are a RAG answerer. Cite sources."},
{"role": "user", "content": user_question_with_context},
],
logprobs=True, # return logprobs for each output token
top_logprobs=5, # also return 5 most likely alternatives per token
max_completion_tokens=512,
temperature=0.2,
)
choice = completion.choices[0]
tokens = choice.logprobs["content"] # provider-specific structure
Exact shapes vary by SDK, but typically each token structure exposes:
tokenortextlogprob(the chosen token’s log probability)top_logprobs: a list/dict of alternative tokens and their logprobs
Gemini 2.5 on Vertex AI
Google’s 2025 Gemini logprobs guide shows how to enable log probabilities via response_logprobs and logprobs in GenerateContentConfig:
from google import genai
from google.genai.types import GenerateContentConfig
client = genai.Client(vertexai=True, project=PROJECT_ID, location="global")
MODEL_ID = "gemini-2.5-flash"
response = client.models.generate_content(
model=MODEL_ID,
contents=rag_prompt,
generation_config=GenerateContentConfig(
response_logprobs=True,
logprobs=5, # number of alternative tokens to return
),
)
logprobs_result = response.candidates[0].logprobs_result
chosen = logprobs_result.chosen_candidates # token-level logprobs
Google’s example explicitly uses this for “quantitative RAG evaluation”, correlating higher average logprobs with better retrieval quality (good vs poor vs no retrieval).
Local LLMs, vLLM, and others
For self-hosted RAG, many runtimes now expose logprobs:
- Ollama (v0.12.11 and later) returns logprobs in both its native and OpenAI-compatible APIs.
- vLLM supports
prompt_logprobsand output logprobs via its Python and HTTP interfaces. - Text Generation Inference (TGI) and other inference servers typically expose logits or logprobs in streaming metadata.
Implementation details differ, but the detection logic you’ll implement is the same: iterate over output tokens, aggregate logprobs, and compute scores.
Architecture: adding a logprobs-based hallucination detector to RAG
Here’s how a production-ready RAG pipeline with logprobs-based hallucination detection fits together:
The key addition is a confidence scoring module that:
- Consumes tokens and logprobs from the LLM.
- Aligns tokens to sentences or spans.
- Computes multiple scores (global, sentence-level, margin/entropy).
- Feeds a simple decision policy (allow, clarify, or block & fallback).
Step-by-step flow
- User query comes in.
- Retriever fetches top-k relevant chunks from a vector DB.
- You build a RAG prompt with instructions, context, and the question.
- You call the LLM with logprobs enabled (
logprobs=True,top_logprobsor Gemini’sresponse_logprobs). - The LLM returns answer text + token logprobs.
- Your confidence module parses tokens and computes:
- Average logprob for the whole answer.
- Average logprob per sentence/span.
- Optional: entropy/margin scores per token or span.
- A decision layer applies thresholds:
- High confidence → return as-is.
- Medium confidence → add disclaimer or ask user to confirm.
- Low confidence → return safe fallback or say “I don’t know,” log for review.
Next we’ll get concrete with code and thresholds.
Implementing a simple logprobs-based RAG hallucination detector
1. Request logprobs from your LLM
Using OpenAI’s chat completions as an example (Python SDK, 2025):
from openai import OpenAI
import math
client = OpenAI()
def generate_with_logprobs(context_chunks, question):
context = "\n\n".join(context_chunks)
rag_prompt = (
"You are a RAG assistant. Answer using ONLY the information in the context. "
"If the answer is not in the context, say you don't know.\n\n"
f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"
)
completion = client.chat.completions.create(
model="gpt-4.1-2025-04-14",
messages=[{"role": "user", "content": rag_prompt}],
logprobs=True,
top_logprobs=5,
max_completion_tokens=256,
temperature=0.2,
)
choice = completion.choices[0]
text = choice.message.content
token_infos = choice.logprobs["content"] # list of token+logprob dicts
return text, token_infos
2. Compute sequence- and span-level confidence
Next, convert per-token logprobs into useful scores. The simplest is the average logprob over the full answer.
def average_logprob(tokens):
# tokens: list of dicts with 'logprob' or similar
lp_values = [t["logprob"] for t in tokens if t.get("logprob") is not None]
if not lp_values:
return None
return sum(lp_values) / len(lp_values)
def logprob_to_probability(lp):
if lp is None:
return None
return math.exp(lp) # convert log p back to p in [0,1]
For hallucination detection, you should go one level deeper and score sentences. A simple approach:
- Join tokens to reconstruct the answer text.
- Split the answer into sentences (with a robust sentence splitter).
- Align tokens to sentences by character position.
- Compute average logprob per sentence.
import re
def split_sentences(text):
# naive but serviceable; swap for spaCy or similar in prod
return [s.strip() for s in re.split(r"(?<=[.!?])\s+", text) if s.strip()]
def sentence_confidences(text, tokens):
sentences = split_sentences(text)
# track character offsets as we reconstruct text from tokens
confidences = []
char_index = 0
token_idx = 0
for sent in sentences:
sent_start = text.find(sent, char_index)
sent_end = sent_start + len(sent)
sent_tokens = []
while token_idx < len(tokens):
tok = tokens[token_idx]
tok_text = tok["token"] # provider-specific
tok_start = text.find(tok_text, char_index)
if tok_start == -1 or tok_start >= sent_end:
break
sent_tokens.append(tok)
char_index = tok_start + len(tok_text)
token_idx += 1
avg_lp = average_logprob(sent_tokens)
confidences.append({"sentence": sent, "avg_logprob": avg_lp})
return confidences
This doesn’t need to be perfect; you mostly want relative differences between sentences. Sentences with much lower average logprobs are prime hallucination suspects.
3. Define practical thresholds
Thresholds should be tuned per model and domain using held-out data, but you can start with pragmatic defaults. Vellum’s 2024 logprobs guide and Google’s Gemini RAG example both show clear separation between “good retrieval” and “no retrieval” when comparing average logprobs.
| Metric | Typical signal | Example heuristic |
|---|---|---|
| Average logprob (answer) | Higher (closer to 0) when retrieval is good | If < −2.5, downgrade or block answer |
| Average logprob (sentence) | Outlier sentences often dubious | If sentence >0.8 below answer avg, mark as low-confidence |
| Token-level margin | Small gap between top-2 tokens = ambiguity | If many tokens have margin < 0.5, treat span as uncertain |
Here’s how you might implement a simple decision function:
def classify_answer_confidence(text, tokens,
low_lp_threshold=-2.5,
sentence_lp_gap=0.8):
avg_lp = average_logprob(tokens)
if avg_lp is None:
return {"level": "unknown", "reason": "no_logprobs"}
sent_scores = sentence_confidences(text, tokens)
# count very low-confidence sentences
low_sentences = [
s for s in sent_scores
if s["avg_logprob"] is not None
and (avg_lp - s["avg_logprob"]) >= sentence_lp_gap
]
if avg_lp < low_lp_threshold or len(low_sentences) >= 2:
level = "low"
elif low_sentences:
level = "medium"
else:
level = "high"
return {
"level": level,
"answer_avg_logprob": avg_lp,
"sentence_scores": sent_scores,
"low_conf_sentences": low_sentences,
}
4. Wire confidence into your RAG UX
Once you have a confidence classification, integrate it into your response policy:
- High confidence:
- Return the answer normally, with citations.
- Optionally log score for monitoring.
- Medium confidence:
- Return the answer but visually flag low-confidence sentences (e.g., yellow highlight, tooltip “model uncertain here”).
- Append a soft disclaimer: “Some parts of this answer may be uncertain; please verify critical details.”
- Log the example for offline review and threshold tuning.
- Low confidence:
- Either:
- Answer: “I’m not confident I can answer this from the provided sources” and show retrieved context, or
- Fallback to a simpler, safe template (e.g., “This information is not available in our docs”).
- Optionally trigger:
- A second-pass query with different retrieval settings, or
- Escalation to human support (for customer-facing apps).
- Either:
def answer_with_hallucination_guard(context_chunks, question):
text, tokens = generate_with_logprobs(context_chunks, question)
conf = classify_answer_confidence(text, tokens)
if conf["level"] == "low":
return {
"answer": "I'm not confident I can answer this from the provided sources.",
"confidence": conf,
"status": "fallback",
}
elif conf["level"] == "medium":
return {
"answer": text,
"confidence": conf,
"status": "warn",
}
else:
return {
"answer": text,
"confidence": conf,
"status": "ok",
}
This pattern lets you fail gracefully instead of confidently hallucinating, which is exactly what you want in production RAG.
Best practices and pitfalls when using logprobs for hallucination detection
1. Calibrate per model and domain
Logprob scales are model-dependent. A logprob of −2.0 on Gemini 2.5 Flash may not mean the same thing as −2.0 on GPT‑5.1. Before enforcing thresholds:
- Collect a labeled dataset of RAG answers (correct vs hallucinated) from your domain.
- Compute average logprob and span-level scores.
- Plot distributions and choose thresholds that balance false positives/negatives.
2. Combine logprobs with retrieval signals
Logprobs alone don’t “know” whether a fact is in your corpus; they only quantify model confidence. For RAG hallucinations, you get best results by combining:
- Retrieval quality: e.g., similarity scores, coverage of entities in retrieved docs.
- Answer–context overlap: lexical/semantic similarity between answer and retrieved passages.
- Logprobs-based confidence: as described above.
For example, you might only trust high logprob answers when both retrieval scores are strong and the answer quotes or closely paraphrases the context.
3. Beware of stylistic and boilerplate tokens
Tokens like “However,” “In conclusion,” or generic safety phrases often have high logprobs but add no factual value. To avoid inflated scores:
- Exclude leading/trailing boilerplate from scoring windows.
- Weight content words (nouns, numbers, entities) higher in your span-level averages.
4. Streaming considerations
If you stream outputs to users, you can still track logprobs token-by-token:
- Buffer tokens and logprobs in memory.
- Compute rolling averages as the answer unfolds.
- If the rolling average drops below your low-confidence threshold early, you can:
- Stop the stream,
- Switch to a fallback answer, or
- Append a prominent warning at the end.
5. Respect provider limitations
Finally, check each provider’s current limitations:
- Some models (e.g., certain reasoning or vision variants) may not support logprobs.
- Anthropic’s Claude API, as of late 2024 reports, does not expose token logprobs directly; you’d need a different model or host an open model.
- High
top_logprobsvalues add response size; keep them modest (e.g., 3–5) in production.
Conclusion: from “black box” RAG to monitored, trustworthy systems
Token-level logprobs turn your RAG system from a blind generator into a measurable, monitorable service. By exposing the model’s own uncertainty, you can:
- Detect low-confidence, likely hallucinated answers before users see them.
- Flag specific sentences or spans as dubious instead of rejecting whole answers.
- Calibrate thresholds per model and domain using real production traffic.
- Combine confidence with retrieval quality to build robust, production-ready RAG.
To apply this today:
- Enable logprobs in your current LLM API (OpenAI, Gemini, or your local stack).
- Implement the simple scoring and thresholding patterns above.
- Log confidence metrics alongside each RAG response and iterate on thresholds.
As models and APIs continue to evolve through 2025, treating logprobs as a first-class signal in your architecture is one of the simplest, highest-leverage steps you can take to keep RAG hallucinations from undermining your system’s credibility.