Geometric Deviation: Detecting LLM Hallucinations Pre-Gen
New research probes LLM internal representations to detect unanswerable questions before generation begins, using geometric deviation as an unsupervised reliability signal that could reduce hallucinations without retraining.
A new arXiv preprint proposes a novel approach to one of the most stubborn problems in large language models: knowing when not to answer. Titled "Geometric Deviation as an Unsupervised Pre-Generation Reliability Signal: Probing LLM Representations for Answerability," the paper explores whether the geometry of an LLM's internal hidden states can reveal — before a single token is generated — whether the model is likely to hallucinate or produce an unreliable response.
The Core Idea: Listen to the Model Before It Speaks
Most hallucination mitigation strategies operate after generation: post-hoc fact-checking, retrieval verification, self-consistency sampling, or fine-tuned refusal classifiers. These approaches are computationally expensive and reactive. The authors argue that the model itself encodes uncertainty about answerability in its representation space, and that this uncertainty can be detected geometrically — without supervision, labels, or retraining.
The intuition is straightforward. When an LLM processes a well-formed, answerable query, the resulting hidden state aligns with familiar regions of representation space shaped during pretraining. When it encounters something it cannot reliably answer — an unanswerable factual question, a malformed prompt, an out-of-distribution topic — its internal representation deviates from those typical manifolds. Measuring that deviation gives a pre-generation reliability signal.
Why Pre-Generation Matters
Pre-generation signals are valuable for several reasons:
- Compute efficiency: Aborting or rerouting an unanswerable query before decoding saves the cost of generating tokens that will be discarded or corrected.
- Safety: In high-stakes deployments — medical, legal, journalistic — refusing to answer is often safer than generating plausible fabrication.
- Routing: A reliability signal can trigger retrieval-augmented generation (RAG), tool use, or escalation to a stronger model only when needed.
- Authenticity: For synthetic media pipelines that rely on LLMs for captioning, scripting, or fact assembly, knowing when the model is on shaky ground reduces downstream misinformation risk.
Geometric Deviation as a Probe
The method treats the LLM's hidden state as a point in a high-dimensional manifold and measures how far that point sits from the typical distribution induced by answerable inputs. Concretely, this involves comparing activations against a reference set of "in-distribution" representations and computing distance metrics — likely cosine, Mahalanobis, or density-based measures — that capture deviation from the learned answerability manifold.
Because the approach is unsupervised, it does not require labeled answerable/unanswerable pairs. This is a meaningful practical advantage: labeling answerability at scale is expensive and subjective, and supervised refusal classifiers tend to overfit to the prompt distributions they were trained on. A geometric probe, by contrast, leverages the model's own representational structure.
Connections to Broader Reliability Research
This work sits in a growing line of research that treats LLM internals as a diagnostic surface. Related directions include semantic entropy estimation, activation-based hallucination detectors, linear probes for truthfulness, and uncertainty quantification through embedding-space analysis. What distinguishes the geometric deviation approach is its focus on pre-generation answerability rather than post-hoc factuality, and its commitment to label-free deployment.
For practitioners building production systems, the implications are concrete. A lightweight probe attached to an existing LLM could flag risky queries in real time, route them to RAG or human review, and log them for dataset curation. This is particularly relevant for AI-assisted content authentication systems, where an LLM that confidently fabricates verification details is worse than one that abstains.
Open Questions
Several questions remain for follow-up work. How does geometric deviation correlate with actual hallucination rates across model scales? Does the signal generalize across architectures (dense transformers, MoE models, multimodal systems)? Can adversarial prompts be crafted to spoof the geometric signature of answerability? And how does the probe interact with instruction-tuned and RLHF-aligned models, whose representations are reshaped by alignment training?
Implications for Synthetic Media and Authenticity
For the digital authenticity community, reliability signals like this matter beyond pure NLP. LLMs increasingly drive metadata generation, provenance descriptions, and AI-assisted moderation. A pre-generation answerability check is a small but meaningful primitive in building synthetic media pipelines that know what they don't know — an essential property for any system claiming to support authenticity rather than erode it.
Stay informed on AI video and digital authenticity. Follow Skrew AI News.