Anti-Sycophancy Enforcement: Why AI Confidence Uniformity Is the Real Problem
Most general-purpose AI assistants share the same failure mode. Not hallucination — that problem is visible and increasingly addressed. The deeper failure is uniformity: every claim delivered at the same rhetorical temperature, with the same confidence, in the same authoritative voice. A factual restatement of a file’s contents reads identically to a speculative interpretation of what those contents imply. The user receives no signal about which statements to trust and which to scrutinize.
This is what we call confidence sycophancy — and it may be more consequential than the agreement sycophancy the field has spent years studying.
Confidence Sycophancy vs Agreement Sycophancy
The sycophancy literature — Perez et al. (2023), Sharma et al. (2023) — defines the problem primarily as models telling users what they want to hear. Ask a model “Is my business plan good?” and it will validate rather than critique. That is agreement sycophancy: distortion of content to match user preferences.
Confidence sycophancy operates on a different axis. It distorts epistemic framing — presenting all outputs as equally trustworthy regardless of their evidential basis. An AI response that restates verified tool output looks and sounds identical to one that extrapolates beyond available evidence. Both forms erode the user’s capacity for independent judgment, but they require different interventions. Agreement sycophancy needs better training incentives. Confidence sycophancy needs structural enforcement.
We argue the distinction matters because both phenomena likely share a common root in RLHF. Sharma et al. (2023) demonstrated that RLHF-trained models exhibit sycophantic behavior shaped by annotator preferences. We suspect the same dynamic drives confidence uniformity — annotators tend to reward authoritative-sounding responses, and the training process selects for rhetorical uniformity because uniform confidence feels authoritative. The result is a system that has learned to sound equally certain about everything.
The Evidence: Cognitive Surrender at Scale
Shaw & Nave (2026) provided the empirical grounding. Across three preregistered experiments (N = 1,372; 9,593 trials), participants completed adapted Cognitive Reflection Test items with access to AI guidance. The findings:
| Finding | Measure | Implication |
|---|---|---|
| Majority consultation | >50% of trials involved AI consultation | Users consistently delegate cognitive work |
| Accuracy dependence | ~+25pp (AI-Accurate) / ~-15pp (AI-Faulty) | Users follow AI regardless of correctness |
| Effect magnitude | Cohen’s h = 0.81 | Large behavioral effect, not marginal |
| Robustness | Effect held across dispositional moderators | Not limited to low-expertise users |
The critical pattern is not any single metric but the behavioral signature: participants followed AI outputs regardless of whether those outputs were accurate or deliberately faulty. Shaw & Nave term this cognitive surrender — the delegation of epistemic evaluation to AI without maintaining oversight.
Their Tri-System Theory frames AI as System 3: external cognition operating outside the brain. Delegation to System 3 is fine when System 2 maintains oversight. It becomes cognitive surrender when the user cannot distinguish which outputs deserve scrutiny. Uniform confidence removes exactly that signal.
Three Layers, One Architecture
We address this as an enforcement problem, not a training problem. Three layers, each modeled on a mechanism from human social organization:
Layer 1: Conscience — an internal steering rule instructing the AI to tier every claim by epistemic confidence. The rule explains why tiering matters, what to tier (factual assertions, causal explanations, recommendations, predictions), and what to skip (procedural statements, direct tool output). Conscience is necessary but gameable — an AI can comply in form while violating in spirit.
Layer 2: Law — a formal specification defining four confidence tiers with exact formats:
| Tier | Format | Verification Standard |
|---|---|---|
| Established | (no qualifier) | Restates verifiable tool output directly |
| Informed | ”Based on [source], …” | Source must exist in session history |
| Speculative | ”I suspect [claim] — [basis]“ | Basis must be stated with reasoning |
| Creative | ”What if [claim] — connecting [A] to [B]“ | Must name concepts being connected |
The specification also defines six anti-patterns (including “marker presence without substance” — prepending “Based on” to claims made regardless of any source) and requires a mandatory Unresolved Tensions section before any verification verdict.
Layer 3: Society — a deterministic external hook that runs after every AI response. It builds an evidence corpus from all non-AI transcript entries, extracts tiered claims via regex, and cross-verifies each Informed claim against the corpus.
No single layer is sufficient. Conscience without law produces inconsistent behavior. Law without enforcement produces format compliance without substance. Enforcement without conscience and law produces an adversarial dynamic where the system optimizes for passing checks rather than being honest.
Asymmetric Control: The Key Insight
Previous approaches to AI honesty rely on the AI self-reporting its confidence — asking the system to mark its own homework. Our enforcement hook checks the homework against an answer key the student did not write.
The AI controls what it claims. It does NOT control the evidence corpus.
The evidence corpus is constructed from tool outputs, user messages, and system messages — entries the AI did not author. The AI can cite “Based on the file contents” but it cannot retroactively insert content into the tool output history to make that citation verify.
The cross-verification uses three strategies in sequence for each Informed claim:
- Direct substring match — the cited source text appears verbatim in the corpus
- Significant word overlap — >50% of non-stop-word terms appear in the corpus
- Specific reference patterns — numbers or proper nouns from the citation appear in the corpus
When violations are detected, the hook outputs a user-visible enforcement notice:
ENFORCEMENT (2/6 checks failed | corpus: 47 refs | receipt: a3f7c2b1...)
- informed_evidence: 1/3 Informed claims cite unverifiable sources: "the literature".
- speculative_basis: 1/2 Speculative claims lack pattern basis.
Layer 3 is what prevents Layers 1 and 2 from being gamed. Without it, an AI could adopt tier markers as rhetorical decoration. With it, “Based on the research” fails verification when no specific research appears in the tool output history.
Deterministic Workflows from Non-Deterministic Tools
A design principle runs through every layer of this architecture: creating deterministic outputs wherever possible.
The AI that generates the response is non-deterministic. The same prompt fed to the same model twice will produce different text. Temperature, sampling, context window position — all introduce variance. This is not a flaw; it is the nature of generative models. But verification cannot inherit that variance. If the system that checks claims is itself non-deterministic, you have not solved the trust problem — you have doubled it.
So we made a deliberate architectural choice: the enforcement layer uses only deterministic operations. Regex extraction. Substring matching. Word overlap counting. SHA-256 hashing. Given the same AI response and the same evidence corpus, the enforcement hook produces the same verdict every time. No model calls. No embeddings. No “let the AI judge itself.” The verification pipeline is a pure function.
This matters more than it might seem. Consider the alternative: using an LLM to evaluate whether another LLM’s citations are accurate. The evaluator LLM introduces its own confidence sycophancy — it might judge citations as valid because they sound plausible, reproducing the exact failure mode the architecture exists to catch. Non-deterministic verification inherits the epistemic problems of the system it monitors.
The principle extends beyond this specific architecture. For a broad class of verification tasks, workflows built on top of non-deterministic tools — LLMs, generative models, stochastic processes — can still produce deterministic outputs at the verification layer. The inputs do not need to be deterministic for the verification to be. A non-deterministic tool generates a candidate; a deterministic process evaluates it. The generation is creative and variable; the evaluation is mechanical and repeatable. This separation is what makes the system auditable. You can replay any enforcement decision, and it will produce the same result — even though the AI response that triggered it could never be reproduced exactly.
The enforcement hook’s determinism is also what makes it resistant to a class of attacks that would defeat model-based verification. An adversary who understands how an LLM-based judge works can craft responses that exploit the judge’s own biases. An adversary who understands how regex extraction and substring matching work faces a narrower problem: either the cited source exists in the corpus or it does not. There is no “convincing” a deterministic existence check — though existence is not accuracy, a limitation explored below.
We traded coverage for reliability. A model-based verifier could catch more — paraphrased citations, semantic misrepresentations, claims phrased in novel ways. But it would catch them sometimes, with confidence scores and thresholds and false positives that shift between runs. The deterministic pipeline catches less, but what it catches, it catches every time.
Reference Implementation and Stress Testing
The reference implementation is ~750 lines of TypeScript with 39 passing tests, MIT-licensed. Integration requires three things: post-response hooks, access to the session transcript, and a way to inject system messages.
import { runAllChecks } from './src/praxis-enforcer';
const result = runAllChecks(transcriptLines, lastResponse);
if (result.failedChecks.length > 0) {
// Display enforcement notice to user
}
We stress-tested the architecture with 4,400 cases: 3,000 synthetic conforming/non-conforming responses and 1,400 adversarial cases across seven attack categories. Overall accuracy: 96.2% (4,233/4,400 correctly classified).
All seven adversarial categories — chain-of-thought laundering, established-tier exploitation, regex evasion, source fabrication, source misrepresentation, and vague sourcing — achieved 0% evasion rates against the enforcer, with one exception: source misrepresentation achieved an 83.5% evasion rate. The system verifies that cited sources exist but cannot detect when their contents are mischaracterized. This confirms the citation-accuracy gap as the architecture’s most important open problem.
The Gaps We Have Not Closed
We want to be direct about what this does not solve.
The Established tier is unverified. Claims with no qualifier are assumed to restate tool output but are not cross-checked. Most AI output consists of unqualified assertions — the enforcement layer operates on the minority of claims carrying explicit tier markers. An AI could drop all markers and every claim becomes “Established” by omission. This is the architecture’s most significant coverage gap.
Source existence is not source accuracy. The hook verifies that cited sources exist in the evidence corpus, not that the AI accurately represents them. Addressing this would require semantic similarity comparison — introducing model-dependent nondeterminism into what is currently a deterministic pipeline.
Regex extraction is brittle. “Drawing from,” “According to,” or “As evidenced by” bypass extraction entirely. This vulnerability grows as models generate more diverse phrasings.
No live adversarial evaluation. The gaming strategies are theoretical and automated. We have not red-teamed this against human adversaries attempting live bypass.
The central empirical question the architecture raises but does not yet answer: does tiering change user behavior? Shaw & Nave demonstrated that uniform confidence leads to cognitive surrender. We have built the infrastructure to restore confidence differentiation. Whether users respond to that differentiation — whether they scrutinize Speculative claims more than Established ones — is what a user study would need to show.
The design principle generalizes beyond this specific implementation: wherever an AI system makes claims, there should be an external mechanism that checks those claims against independently established evidence. Not because AI systems are dishonest, but because self-reporting — in humans or machines — is necessary but not sufficient for accountability.
Conscience, law, and society. All three. Always.
The full paper — “Who Checks the AI? Anti-Sycophancy Through Conscience, Law, and Society” — and the reference implementation (TypeScript, 39 tests passing) are available at github.com/srieg/anti-sycophancy-enforcer.
This architecture was developed as part of the PAI open-source project.