LLM Bias

Name Swaps Expose Hidden Bias in LLM Judgments

A new paper shows that changing only names in prompts can flip LLM verdicts, revealing systematic bias through intervention consistency tests. The findings matter for AI moderation, authenticity review, and automated decision systems.

Editorial Team

20 Mar 2026 — 4 min read

Large language models are increasingly used to evaluate content, rank outputs, moderate user submissions, and even act as automated judges in workflows that touch media trust and authenticity. A new paper, When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making, argues that these systems can produce meaningfully different decisions when a prompt is changed in only one narrow way: the names of the people involved.

The core idea is intervention consistency. If an LLM is making a decision based on the substance of a scenario, then swapping identity markers such as names should not change the verdict unless those names carry genuinely relevant information. When a model’s answer shifts anyway, the paper frames that as evidence of systematic bias rather than harmless variability.

A simple test for a serious problem

The authors focus on a causal-style intervention: hold the scenario constant, then alter only the names attached to the participants. In principle, this should be one of the cleanest robustness checks available for decision systems. If the model changes its ruling, confidence score, or recommendation, that suggests the system is relying on spurious associations learned during training.

This is an important distinction. A lot of LLM reliability work looks at prompt sensitivity in broad terms, where multiple wording changes may affect multiple latent cues at once. By contrast, a name-swap intervention is deliberately narrow. It isolates one attribute and asks whether that attribute improperly influences the outcome.

That makes the work relevant well beyond standard fairness debates. In synthetic media pipelines, language models are increasingly used upstream and downstream: to classify content, route moderation cases, prioritize abuse reports, draft trust-and-safety rationales, and assist human reviewers. If those systems are sensitive to identity signals that should be irrelevant, they can distort authenticity decisions at scale.

Why this matters for AI video and digital authenticity

At first glance, a paper about LLM verdicts may seem adjacent to Skrew AI News rather than central. But the link to digital authenticity is strong. Modern synthetic media systems are not only generators; they are surrounded by decision layers. Those layers determine which videos are flagged as manipulated, which voice clones are escalated for review, which identity claims are treated as credible, and which accounts are seen as risky.

In practice, many of those layers involve LLM-assisted reasoning. A moderation system might summarize a complaint about a suspected deepfake. An authenticity tool might use an LLM to interpret metadata, contextual claims, or user reports. A platform trust team might deploy an LLM-as-judge setup to score appeals or classify policy severity. If a name change can alter the model’s decision, then identity-linked bias may be entering systems that users assume are neutral.

That is especially consequential in fraud, impersonation, and reputational harm cases. Deepfake incidents often involve real people, disputed context, and incomplete evidence. Automated assistants in these workflows need to be stable under minor interventions. Otherwise, the same manipulated clip could receive a different handling path depending on whose name appears in the complaint, caption, or case notes.

From prompt sensitivity to measurable fairness checks

The paper’s contribution is also methodological. Rather than discussing bias in abstract terms, it points toward an operational test that can be built into evaluation suites. Intervention consistency can be measured, benchmarked, and compared across models. That gives developers a concrete way to stress-test systems before deployment.

For teams building authenticity products, this matters because many current benchmarks emphasize accuracy, not invariance. A detector or judge may perform well on aggregate metrics while still producing unstable outcomes across identity-coded inputs. The intervention-consistency framing encourages a richer view of model quality: not just whether a system gets the answer right on average, but whether it gets there for the right reasons.

That has direct implications for enterprise procurement and risk management. Vendors selling AI moderation or authenticity tooling may increasingly need to show not only performance claims, but also evidence that their models are robust to controlled identity interventions. In regulated or high-liability settings, that kind of documentation could become as important as standard benchmark scores.

A governance signal for AI-assisted review systems

The paper also arrives at a moment when LLMs are being used as judges more broadly, from internal quality scoring to customer-facing decision support. That trend creates efficiency, but it also creates a governance problem: automated decisions can inherit hidden priors from training data and present them with a veneer of consistency.

Intervention consistency offers a way to audit that risk. It is not a complete fairness solution, but it is a useful one because it is intuitive, testable, and hard to dismiss. If changing only the name changes the verdict, then something is wrong with the decision process.

For the synthetic media sector, that insight is highly practical. As companies race to deploy AI systems that generate, label, detect, and adjudicate content, the trustworthiness of the surrounding decision stack matters as much as the generator or detector itself. Bias in those supporting models can undermine appeals, moderation quality, and user confidence in digital authenticity claims.

This is the kind of research that deserves attention because it sharpens a hard question for the entire AI ecosystem: when an automated system makes a judgment, is it responding to evidence, or to incidental identity cues embedded in the prompt? For anyone building AI systems around media integrity, that is not an academic edge case. It is a deployment issue.

View Source

Stay informed on AI video and digital authenticity. Follow Skrew AI News.