Detect RAG Hallucinations with Logprobs: A Practical Guide

Retrieval-Augmented Generation (RAG) should make your LLM safer and more factual, yet in production many teams still get burned by subtle hallucinations. As of November 2025, leading APIs like OpenAI’s GPT‑5.x chat completions and Google’s Gemini 2.5 models expose token-level log probabilities (“logprobs”), giving you a powerful signal: how confident the model really was in each piece of its answer. This guide shows how to turn that raw signal into a practical hallucination detector for RAG. You’ll learn how logprobs work, how to wire them into modern APIs, and how to build a lightweight confidence filter that keeps dubious, fabricated answers away from your users.

Evergreen context: why logprobs matter for RAG hallucinations

Hallucinations in RAG systems are especially damaging because users assume retrieved citations guarantee correctness. Research in 2024–2025 on hallucination detection in RAG (e.g., Cleanlab’s RAG benchmarking, Stanford’s legal RAG study, ReDeEP 2025) repeatedly finds that:

RAG reduces hallucinations but does not eliminate them.
Heuristic detectors (e.g., pattern-matching on citations) miss many subtle fabrications.
Uncertainty signals from the model itself (token probabilities, entropy, margin scores) are strong predictors of wrong answers.

At the same time, platform support for logprobs has matured:

OpenAI: The Chat Completions API (e.g., gpt‑4.1‑2025‑04‑14, gpt‑5.1) supports logprobs and top_logprobs for token-level probabilities, per the 2025 API reference.
Google Gemini: A July 2025 Google Developers post introduces response_logprobs and logprobs for Gemini 2.5 models on Vertex AI, including a quantitative RAG evaluation example.
Local / open models: Frameworks like Ollama (v0.12+), vLLM, and others expose token logprobs in their APIs or logs.

This makes it realistic to add logprobs-based hallucination detection to production RAG without exotic infrastructure or extra models.

Logprobs 101: turning token likelihoods into confidence scores

A language model predicts the next token with a probability between 0 and 1. Instead of returning that directly, most APIs expose the natural logarithm of the probability, the log probability or logprob:

Probability 1.0 → logprob 0.0 (max confidence)
Probability 0.5 → logprob ≈ −0.69
Probability 0.01 → logprob ≈ −4.6

Key implications for hallucination detection:

Closer to zero = higher confidence. More negative values mean the model was less sure about that token.
Sequence logprob is the sum of token logprobs; average sequence logprob is that sum divided by token count, which normalizes for length.
Local dips (a cluster of low-probability tokens) often mark semantic uncertainty, even when the overall answer looks fluent.

“A logprob score closer to 0 indicates higher model confidence in its choice.”
Google Developers Blog, “Unlock Gemini’s reasoning with logprobs,” July 2025

For RAG hallucinations, we care about where the model is uncertain, not just whether the entire answer was low-confidence. That means we’ll:

Compute global confidence (average logprob across the answer) for simple allow/deny.
Compute span-level confidence to flag specific sentences or facts as dubious.

How to get logprobs from modern LLM APIs

OpenAI Chat Completions (gpt‑4.1, gpt‑5.x)

As of 2025, OpenAI’s Chat Completions endpoint supports logprobs and top_logprobs for most chat models:

from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4.1-2025-04-14",
    messages=[
        {"role": "system", "content": "You are a RAG answerer. Cite sources."},
        {"role": "user", "content": user_question_with_context},
    ],
    logprobs=True,        # return logprobs for each output token
    top_logprobs=5,       # also return 5 most likely alternatives per token
    max_completion_tokens=512,
    temperature=0.2,
)

choice = completion.choices[0]
tokens = choice.logprobs["content"]   # provider-specific structure

Exact shapes vary by SDK, but typically each token structure exposes:

token or text
logprob (the chosen token’s log probability)
top_logprobs: a list/dict of alternative tokens and their logprobs

Gemini 2.5 on Vertex AI

Google’s 2025 Gemini logprobs guide shows how to enable log probabilities via response_logprobs and logprobs in GenerateContentConfig:

from google import genai
from google.genai.types import GenerateContentConfig

client = genai.Client(vertexai=True, project=PROJECT_ID, location="global")
MODEL_ID = "gemini-2.5-flash"

response = client.models.generate_content(
    model=MODEL_ID,
    contents=rag_prompt,
    generation_config=GenerateContentConfig(
        response_logprobs=True,
        logprobs=5,  # number of alternative tokens to return
    ),
)

logprobs_result = response.candidates[0].logprobs_result
chosen = logprobs_result.chosen_candidates  # token-level logprobs

Google’s example explicitly uses this for “quantitative RAG evaluation”, correlating higher average logprobs with better retrieval quality (good vs poor vs no retrieval).

Local LLMs, vLLM, and others

For self-hosted RAG, many runtimes now expose logprobs:

Ollama (v0.12.11 and later) returns logprobs in both its native and OpenAI-compatible APIs.
vLLM supports prompt_logprobs and output logprobs via its Python and HTTP interfaces.
Text Generation Inference (TGI) and other inference servers typically expose logits or logprobs in streaming metadata.

Implementation details differ, but the detection logic you’ll implement is the same: iterate over output tokens, aggregate logprobs, and compute scores.

Architecture: adding a logprobs-based hallucination detector to RAG

Here’s how a production-ready RAG pipeline with logprobs-based hallucination detection fits together:

Architecture diagram showing a RAG system with user query, retriever and vector database, LLM with logprobs output, a confidence scoring module that computes sequence and span-level scores, and a decision layer that either returns the answer, asks for clarification, or falls back to a safe response. — High-level RAG architecture with a logprobs-based hallucination detector.

The key addition is a confidence scoring module that:

Consumes tokens and logprobs from the LLM.
Aligns tokens to sentences or spans.
Computes multiple scores (global, sentence-level, margin/entropy).
Feeds a simple decision policy (allow, clarify, or block & fallback).

Step-by-step flow

User query comes in.
Retriever fetches top-k relevant chunks from a vector DB.
You build a RAG prompt with instructions, context, and the question.
You call the LLM with logprobs enabled (logprobs=True, top_logprobs or Gemini’s response_logprobs).
The LLM returns answer text + token logprobs.
Your confidence module parses tokens and computes:
- Average logprob for the whole answer.
- Average logprob per sentence/span.
- Optional: entropy/margin scores per token or span.
A decision layer applies thresholds:
- High confidence → return as-is.
- Medium confidence → add disclaimer or ask user to confirm.
- Low confidence → return safe fallback or say “I don’t know,” log for review.

Next we’ll get concrete with code and thresholds.

Implementing a simple logprobs-based RAG hallucination detector

1. Request logprobs from your LLM

Using OpenAI’s chat completions as an example (Python SDK, 2025):

from openai import OpenAI
import math

client = OpenAI()

def generate_with_logprobs(context_chunks, question):
    context = "\n\n".join(context_chunks)
    rag_prompt = (
        "You are a RAG assistant. Answer using ONLY the information in the context. "
        "If the answer is not in the context, say you don't know.\n\n"
        f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"
    )

    completion = client.chat.completions.create(
        model="gpt-4.1-2025-04-14",
        messages=[{"role": "user", "content": rag_prompt}],
        logprobs=True,
        top_logprobs=5,
        max_completion_tokens=256,
        temperature=0.2,
    )

    choice = completion.choices[0]
    text = choice.message.content
    token_infos = choice.logprobs["content"]  # list of token+logprob dicts

    return text, token_infos

2. Compute sequence- and span-level confidence

Next, convert per-token logprobs into useful scores. The simplest is the average logprob over the full answer.

def average_logprob(tokens):
    # tokens: list of dicts with 'logprob' or similar
    lp_values = [t["logprob"] for t in tokens if t.get("logprob") is not None]
    if not lp_values:
        return None
    return sum(lp_values) / len(lp_values)

def logprob_to_probability(lp):
    if lp is None:
        return None
    return math.exp(lp)  # convert log p back to p in [0,1]

For hallucination detection, you should go one level deeper and score sentences. A simple approach:

Join tokens to reconstruct the answer text.
Split the answer into sentences (with a robust sentence splitter).
Align tokens to sentences by character position.
Compute average logprob per sentence.

import re

def split_sentences(text):
    # naive but serviceable; swap for spaCy or similar in prod
    return [s.strip() for s in re.split(r"(?<=[.!?])\s+", text) if s.strip()]

def sentence_confidences(text, tokens):
    sentences = split_sentences(text)
    # track character offsets as we reconstruct text from tokens
    confidences = []
    char_index = 0
    token_idx = 0

    for sent in sentences:
        sent_start = text.find(sent, char_index)
        sent_end = sent_start + len(sent)

        sent_tokens = []
        while token_idx < len(tokens):
            tok = tokens[token_idx]
            tok_text = tok["token"]  # provider-specific
            tok_start = text.find(tok_text, char_index)
            if tok_start == -1 or tok_start >= sent_end:
                break
            sent_tokens.append(tok)
            char_index = tok_start + len(tok_text)
            token_idx += 1

        avg_lp = average_logprob(sent_tokens)
        confidences.append({"sentence": sent, "avg_logprob": avg_lp})

    return confidences

This doesn’t need to be perfect; you mostly want relative differences between sentences. Sentences with much lower average logprobs are prime hallucination suspects.

3. Define practical thresholds

Thresholds should be tuned per model and domain using held-out data, but you can start with pragmatic defaults. Vellum’s 2024 logprobs guide and Google’s Gemini RAG example both show clear separation between “good retrieval” and “no retrieval” when comparing average logprobs.

Metric	Typical signal	Example heuristic
Average logprob (answer)	Higher (closer to 0) when retrieval is good	If < −2.5, downgrade or block answer
Average logprob (sentence)	Outlier sentences often dubious	If sentence >0.8 below answer avg, mark as low-confidence
Token-level margin	Small gap between top-2 tokens = ambiguity	If many tokens have margin < 0.5, treat span as uncertain

Here’s how you might implement a simple decision function:

def classify_answer_confidence(text, tokens,
                               low_lp_threshold=-2.5,
                               sentence_lp_gap=0.8):
    avg_lp = average_logprob(tokens)
    if avg_lp is None:
        return {"level": "unknown", "reason": "no_logprobs"}

    sent_scores = sentence_confidences(text, tokens)

    # count very low-confidence sentences
    low_sentences = [
        s for s in sent_scores
        if s["avg_logprob"] is not None
        and (avg_lp - s["avg_logprob"]) >= sentence_lp_gap
    ]

    if avg_lp < low_lp_threshold or len(low_sentences) >= 2:
        level = "low"
    elif low_sentences:
        level = "medium"
    else:
        level = "high"

    return {
        "level": level,
        "answer_avg_logprob": avg_lp,
        "sentence_scores": sent_scores,
        "low_conf_sentences": low_sentences,
    }

4. Wire confidence into your RAG UX

Once you have a confidence classification, integrate it into your response policy:

High confidence:
- Return the answer normally, with citations.
- Optionally log score for monitoring.
Medium confidence:
- Return the answer but visually flag low-confidence sentences (e.g., yellow highlight, tooltip “model uncertain here”).
- Append a soft disclaimer: “Some parts of this answer may be uncertain; please verify critical details.”
- Log the example for offline review and threshold tuning.
Low confidence:
- Either:
  - Answer: “I’m not confident I can answer this from the provided sources” and show retrieved context, or
  - Fallback to a simpler, safe template (e.g., “This information is not available in our docs”).
- Optionally trigger:
  - A second-pass query with different retrieval settings, or
  - Escalation to human support (for customer-facing apps).

def answer_with_hallucination_guard(context_chunks, question):
    text, tokens = generate_with_logprobs(context_chunks, question)
    conf = classify_answer_confidence(text, tokens)

    if conf["level"] == "low":
        return {
            "answer": "I'm not confident I can answer this from the provided sources.",
            "confidence": conf,
            "status": "fallback",
        }
    elif conf["level"] == "medium":
        return {
            "answer": text,
            "confidence": conf,
            "status": "warn",
        }
    else:
        return {
            "answer": text,
            "confidence": conf,
            "status": "ok",
        }

This pattern lets you fail gracefully instead of confidently hallucinating, which is exactly what you want in production RAG.

Best practices and pitfalls when using logprobs for hallucination detection

1. Calibrate per model and domain

Logprob scales are model-dependent. A logprob of −2.0 on Gemini 2.5 Flash may not mean the same thing as −2.0 on GPT‑5.1. Before enforcing thresholds:

Collect a labeled dataset of RAG answers (correct vs hallucinated) from your domain.
Compute average logprob and span-level scores.
Plot distributions and choose thresholds that balance false positives/negatives.

2. Combine logprobs with retrieval signals

Logprobs alone don’t “know” whether a fact is in your corpus; they only quantify model confidence. For RAG hallucinations, you get best results by combining:

Retrieval quality: e.g., similarity scores, coverage of entities in retrieved docs.
Answer–context overlap: lexical/semantic similarity between answer and retrieved passages.
Logprobs-based confidence: as described above.

For example, you might only trust high logprob answers when both retrieval scores are strong and the answer quotes or closely paraphrases the context.

3. Beware of stylistic and boilerplate tokens

Tokens like “However,” “In conclusion,” or generic safety phrases often have high logprobs but add no factual value. To avoid inflated scores:

Exclude leading/trailing boilerplate from scoring windows.
Weight content words (nouns, numbers, entities) higher in your span-level averages.

4. Streaming considerations

If you stream outputs to users, you can still track logprobs token-by-token:

Buffer tokens and logprobs in memory.
Compute rolling averages as the answer unfolds.
If the rolling average drops below your low-confidence threshold early, you can:
- Stop the stream,
- Switch to a fallback answer, or
- Append a prominent warning at the end.

5. Respect provider limitations

Finally, check each provider’s current limitations:

Some models (e.g., certain reasoning or vision variants) may not support logprobs.
Anthropic’s Claude API, as of late 2024 reports, does not expose token logprobs directly; you’d need a different model or host an open model.
High top_logprobs values add response size; keep them modest (e.g., 3–5) in production.

Conclusion: from “black box” RAG to monitored, trustworthy systems

Token-level logprobs turn your RAG system from a blind generator into a measurable, monitorable service. By exposing the model’s own uncertainty, you can:

Detect low-confidence, likely hallucinated answers before users see them.
Flag specific sentences or spans as dubious instead of rejecting whole answers.
Calibrate thresholds per model and domain using real production traffic.
Combine confidence with retrieval quality to build robust, production-ready RAG.

To apply this today:

Enable logprobs in your current LLM API (OpenAI, Gemini, or your local stack).
Implement the simple scoring and thresholding patterns above.
Log confidence metrics alongside each RAG response and iterate on thresholds.

As models and APIs continue to evolve through 2025, treating logprobs as a first-class signal in your architecture is one of the simplest, highest-leverage steps you can take to keep RAG hallucinations from undermining your system’s credibility.

How to Detect RAG Hallucinations with Logprobs: A Practical Guide