Why a Single AI Response Is Not Enough for an Important Decision
How Model Sycophancy, Hidden Instructions, and Structured Disagreement Changed How I Use AI
This article explains why AI assistants can seem extremely convincing while docilely accepting the way you frame your question. We’ll look at what really shapes this behavior behind the scenes, and why I increasingly prefer organized disagreement over a single, neatly packaged response when the stakes are real.
Level: Intermediate
At a Glance
| Question | Short answer |
|---|---|
| Is the problem limited to hallucinations? | No. The real problem is often that the model follows your framing without flinching. |
| Does “predicting the next word” explain everything? | No. Human feedback tuning, safety rules, system instructions, and all the product mechanics heavily influence the final response. |
| Does a model ever respond without any wrapper? | No. It generally responds through a layer of instructions, rules, tools, and priorities. |
| Is a single model enough for important decisions? | Rarely. A single model means a single training path, a single reasoning style, a single family of blind spots. |
| Do all models look alike? | No. They differ in style, caution, speed, cost, and behavior when tools are involved. |
| What is the practical answer? | Make disagreement a process: multiple initial opinions, anonymous review, synthesis, then a human decision. |
Foreword
Over the past few months, I kept running into the same uncomfortable pattern.
I would ask an AI a question about strategy, pricing, positioning, a technical choice, or even a business opportunity. The response seemed clear, reassuring, intelligent. Then I would slightly rephrase the question — sometimes just by nudging the premise (*) a little — and I would get another response just as convincing, but pointing in the opposite direction.
Premise: this is the underlying assumption slipped — consciously or not — into the question. It is the postulate on which everything else in the question rests.
That’s when I started to be much more careful.
Because the real danger is not just that AI can be wrong. The real danger is that it can be wrong in a way that feels calm, coherent, and perfectly in line with your initial intuition.
And for an SME, that makes all the difference.
If the question is “which restaurant to try for lunch today?”, no problem. But if the question is “should we launch this offer?”, “should we automate this role?”, “should we respond to this tender in this way?” or “should we trust this supplier?”, then a nice-looking response is not enough.
You need a response that has been tested.
Personal note: I will intentionally simplify some points here. As usual, I prefer to explain first in my own words, then sharpen the precision where it truly matters. The goal of this article is not academic rigor but practical clarity.
Prerequisites
You do not need to be an AI researcher to follow this article.
What helps is simply:
- you already use AI assistants,
- you have noticed they often seem very confident,
- and you are looking to make better decisions, not just get answers faster.
All code snippets below are illustrative. They are intentionally simplified.
Part 1 — The Real Problem: AI Accepts Your Framing Too Easily
In brief
In my own words, sycophancy refers to the model’s tendency to lean toward what you seem to want to hear, instead of resisting, correcting, or reframing when necessary.
And accepting the framing is when the model adopts without blinking the assumptions embedded in your question, without asking whether they hold up.
That sounds abstract, so let’s take a concrete example.
question_a = """
Why should our SME automate first-level customer support with AI this quarter?
"""
question_b = """
Why would automating first-level customer support with AI this quarter be a mistake?
"""
# Same underlying decision.
# Different framing.
# Very often, two equally persuasive responses.
Explanation
- Both prompts are about the same business decision.
- But each one already carries a direction.
- In many cases, the model does not push back enough against that direction.
- Instead, it helps you build a solid case around the premise you provided.
Let’s make the problem even more visible:
Prompt 1:
"Our pricing is probably too low. Why would a 12% increase be the right decision?"
Prompt 2:
"Our pricing is already fragile. Why would a 12% increase be dangerous?"
Explanation
- In Prompt 1, the model often builds a justification for raising prices.
- In Prompt 2, the same model builds a justification against that increase.
- In both cases, the response can feel thoughtful and strategic.
- The confident tone can mask the fact that the model never first asked: is the premise itself sound?
This is not just an impression. Research on sycophancy has shown that state-of-the-art assistants displayed this behavior across many open-ended tasks, and that responses aligned with the user’s views were more often preferred; the study also found that optimization from preference models could sacrifice accuracy in favor of sycophancy. (arXiv)
IMPORTANT: if you are using AI to make decisions, the question is not only “is the response correct?” The question is also: “did the model actually challenge my framing, or did it simply play the very eloquent intern?”
Part 2 — Why This Actually Happens
In brief
Yes, the classic explanation is partly true: a language model predicts the most likely tokens.
But that is not the whole story. Not even close.
The final behavior you observe in production typically results from several interacting layers:
flowchart TD
A[Base model] --> B[Fine-tuning]
B --> C[Human feedback optimization / RLHF]
C --> D[System instructions]
D --> E[Product-level safety rules]
E --> F[Tool policies]
F --> G[Response displayed to user]
Explanation
- The base model brings general language and reasoning capability.
- Fine-tuning and human feedback optimization shape how it behaves.
- System instructions and product rules constrain tone, priorities, format, and refusals.
- Tooling layers can further influence what the model is allowed to do or say.
- The response you receive is therefore not “pure raw intelligence”. It is the result of an entire behavioral machinery.
That is why saying “it simply predicts the next word” is too reductive to be useful in practice.
flowchart TD
A["Base model\ngeneral language + reasoning"] --> B["Human feedback tuning\nuseful, pleasant, highly rated outputs"]
B --> C["System instructions\nrole, tone, priorities, limits"]
C --> D["Safety layer\nwhat must be refused or softened"]
D --> E["Product mechanics\ntools, formatting, retrieval rules, workflow"]
E --> F["Final response\nwhat the user actually sees"]
Explanation
- The model may have learned that responses perceived as helpful, reassuring, or smooth tend to be rated more highly.
- This creates a pressure toward pleasant usefulness.
- The problem is that pleasant usefulness and raw truth do not always coincide.
- For professionals, this nuance can be very costly.
The sycophancy study points exactly in this direction: human feedback can encourage responses that align with the user’s beliefs rather than truth, and human preference judgments can favor sycophantic but well-written outputs. Anthropic’s constitution explicitly states that they do not want helpfulness to become obsequiousness — which speaks volumes. (arXiv)
So why does “just ask the AI” not work reliably for important decisions?
Because the assistant is not optimized solely for truth. It is also optimized to be helpful, safe, acceptable, and behaviorally aligned.
And sometimes, those objectives pull in opposite directions.
Part 3 — The Hidden Role of the Machinery Surrounding the Model
In brief
A model almost never responds to you without a wrapper.
It responds through a machinery (what I call “the scaffolding”).
In my own words, this machinery is everything that surrounds the model: system instructions, developer guidelines, safety policies, tool rules, formatting rules, memory, routing, retrieval behavior, and application logic.
This wrapper plays an enormous role.
flowchart LR
REQ([Request]) --> SP["System instructions\n'You are a helpful assistant...'"]
REQ --> RD[Developer guidelines]
REQ --> MU["User message\n'Should we raise our prices?'"]
REQ --> OT[Tools]
REQ --> PS[Safety rules]
RD --> RD1[Concise responses]
RD --> RD2[Avoid harmful content]
RD --> RD3[Tool X for web searches]
OT --> OT1[Search]
OT --> OT2[Calculator]
PS --> PS1[Refuse harmful requests]
PS --> PS2[Defuse sensitive cases]
Explanation
- The user only sees the visible conversation.
- But the model often receives far more instructions than the text you type.
- These instructions can shape tone, priorities, and how disagreement is handled.
- So when we compare “models”, we are often in reality comparing model + all the surrounding machinery.
A very good public example is Anthropic’s constitution. Anthropic describes it as a detailed description of Claude’s desired values and behaviors, which plays a central role in training; they also specify that it takes authority over expected behavior, and that it is written primarily for Claude itself. In the most recent explanation, Anthropic states that the constitution is used at multiple stages of training and helps Claude navigate trade-offs such as honesty, helpfulness, and sensitive topics. (anthropic.com)
This is exactly what I want to highlight here: the model does not simply “respond”. It responds through an — explicit or implicit — set of behavioral priorities.
Warning: repositories that collect leaked or extracted system instructions can be very useful for understanding the concept, but I would not treat them as gospel. They are unofficial, incomplete, and potentially outdated snapshots. They are interesting for grasping the existence of hidden instruction layers, not for claiming to know with certainty every rule in production. (GitHub)
To go further, consult Anthropic’s Claude’s Constitution and the accompanying note on the new constitution.
Part 4 — Why a Single Model Is a Blind Spot
In brief
A single model is not just a single source of answers.
It is a single training path, a single preference stack, a single safety philosophy, a single default style, and a single set of blind spots.
That is why relying on a single model for an important decision is often a blind spot in itself.
advisors = ["claude", "gpt", "gemini", "mistral"]
question = "Should we launch this new service for SMEs in Q3?"
responses = {model: query(model, question) for model in advisors}
for model, response in responses.items():
print(f"{model}: {response[:200]}...")
Explanation
- The goal is not for four models to magically surface the truth.
- The goal is for them to bring diversity of thought.
- Different models often notice different risks, assumptions, or opportunities.
- The disagreement itself becomes a source of information.
This is also the central principle of my own project: a single model carries biases linked to its training, its reasoning style, and its blind spots, while multiple models with distinct perspectives create cognitive diversity, blind spot detection, and structured disagreement.
And that is why the problem goes well beyond simple hallucinations.
Sometimes, the model does not hallucinate in a spectacular way.
Sometimes, it keeps serving you the same family of responses.
And that can be just as misleading.
Part 5 — Not All Models Are Equal — and Above All, They Are Not Alike
In brief
I do not think the right lesson is: “find the best model and use it for everything.”
I think the real lesson is: models are complementary.
Some are faster. Some cost less. Some are more literal. Some are more cautious. Some excel at structured synthesis. Some tend more toward extrapolation. Some are more inclined to reframe the question. Some work better when tools are involved.
This diversity is not a flaw. It is a resource.
flowchart LR
Q([Question]) --> A["Strategist\nAnthropic — Synthesis"]
Q --> B["Skeptic\nOpenAI — Structured critique"]
Q --> C["Realist\nGoogle — Grounding and broad view"]
Q --> D["Challenger\nMistral — Alternative reframing"]
Explanation
- I am simplifying here, intentionally.
- Real benchmarking is more nuanced than fixed role labels.
- But in practice, assigning different thinking roles to different models is extremely useful.
- It pulls the workflow out of the pattern of “asking the same question five times to the same brain”.
Karpathy’s LLM Council does exactly this at a high level: multiple models first respond independently, then review and rank each other, and finally a “president” model synthesizes the final response. The technical notes explicitly state that the anonymized review step exists to prevent models from playing favorites. (GitHub)
Part 6 — LLM Council: Turning Multiple Responses Into a Process
In brief
This is the bridge between diagnosis and solution.
The value does not simply lie in “query more models”. The value is in organizing the disagreement.
Here is the simplified logic:
responses = collect_initial_opinions(models, question)
anonymous = anonymize(responses)
reviews = collect_cross_reviews(models, anonymous)
final_response = president_synthesizes(question, anonymous, reviews)
Explanation
- Initial opinions ensure that responses remain independent.
- Anonymization reduces favoritism and brand bias.
- Cross-review forces confrontation between viewpoints.
- Synthesis by the president transforms the debate into an actionable final output.
This is much better than querying a single model once.
It is also better than querying five models and skimming through their responses.
Because once there is a review step, the process starts asking a much more interesting question:
Which response holds up under scrutiny?
Karpathy’s README describes a three-step process: initial opinions, review, final response. The technical notes add that responses are anonymized as A/B/C and ranked by accuracy and relevance before the president’s synthesis. (GitHub)
For details, consult the LLM Council repository and its technical notes.
Part 7 — Why Anonymous Cross-Review Changes Everything
In brief
If you ask the same question to five models and stop there, you have parallel opinions.
That is useful, but not enough.
The real qualitative leap comes when you introduce anonymous peer review.
{
"Response A": "Response from one model",
"Response B": "Response from another model",
"Response C": "Response from a third model"
}
Explanation
- Anonymous labels remove the prestige effect linked to model names.
- Reviewers must react to the content, not the brand.
- Weak responses can no longer hide behind eloquence or reputation.
- The synthesis step becomes far richer as a result.
This is where something fundamental happens: the process stops rewarding only “nice responses” and starts rewarding responses that hold up under pressure.
That is a profound shift.
And in my experience, this is exactly what is missing from most common uses of AI.
The LLM Council technical notes explicitly present anonymous peer review as the key innovation, precisely because it prevents models from playing favorites. (GitHub)
IMPORTANT: asking the same question five times is not the method. The method is to make the responses collide.
The best result is not necessarily the “consensus”.
Sometimes, the best result is: “here is the blind spot that nobody had addressed.”
Part 8 — SPAR-Kit: Structured Contradiction as a Discipline
In brief
I appreciate the LLM Council a great deal, but I also think the underlying idea matters more than any particular implementation.
That is why SPAR-Kit is interesting.
It approaches structured disagreement as a genuine methodology: Structured Persona-Argumentation for Reasoning. Its description is crystal clear: it is a methodology for stress-testing decisions through structured disagreement, which can work with humans, AI personas, or mixed configurations. (GitHub)
In my own words, this means it is not a trick.
It is a discipline.
flowchart TD
N[North\nVision / Direction] --> C((Centre\nSynthesis / Moderation))
E[East\nDisruption / Emergence] --> C
S[South\nExecution / Reality] --> C
O[West\nExperience / Precedent] --> C
Explanation
- Different personas create productive tension.
- Tension surfaces what smooth consensus tends to hide.
- The method works with humans alone, with AI alone, or in mixed configurations.
- Which makes it useful not only for prompting, but for designing genuine decision processes.
SPAR-Kit also raises an important point I fully share: isolated reasoning fails in many contexts — the leader alone, the team alone, the person alone facing an AI, or the AI alone. The goal is not balance for its own sake. The goal is to create authentic tension. (GitHub)
Part 9 — My Approach: From Concept to Operational Workflow
In brief
This is the part where I move from “interesting idea” to “usable system”.
In my own project, the goal is not to build a machine that produces truth. The goal is to industrialize contradiction.
The README describes AI Provocateurs as a multi-model deliberation system that sends questions to multiple providers with different thinking angles, performs anonymous peer review, and produces a structured verdict. It also defines two core skills: /deliberate for multi-perspective decision work, and /analyze for in-depth analysis of documents or URLs.
.claude/
skills/
deliberate/SKILL.md
analyze/SKILL.md
config/
models.yaml
scripts/
llm_call.py
orchestrate.py
Explanation
- A skill is not simply a prompt.
- In my own words, a skill is a reusable operational behavior.
- It encodes how a task should be accomplished, not just what to respond.
- This is how you move from a clever conversation to a reproducible process.
And that makes all the difference.
Because once you define skills, you stop depending on the “lucky prompt”.
You start creating procedures.
/deliberate "Should we launch this new offer for SMEs?"
/deliberate --mode premortem "We are going to update our pricing page"
/analyze --with-qa docs/proposal.md
Explanation
/deliberateis for decisions, trade-offs, challenges, red-teaming, and structured synthesis./analyzeis for deep reading of documents and URLs in multiple passes.- One pipeline consists of contradicting a decision.
- The other consists of contradicting a reading.
This separation matters to me.
Because challenging a decision and deeply analyzing a document are related activities, but not the same work.
The README also describes concrete steps I find essential: neutral reframing, model allocation, parallel dispatch, anonymization, cross-review, synthesis by the president, and report generation. It also supports multi-provider execution, fallback behavior when fewer models are available, and a separate analysis pipeline for documents.
Personal note: this is also why the concept of skills matters so much to me. A raw model can be brilliant while remaining unpredictable. A well-designed skill transforms that capability into something more stable, verifiable, and reusable.
And yes, the code is intended to be fully open source under ai_challengers (GitHub).
This matters, because if we are building systems that influence decisions, I strongly prefer showing the mechanics rather than claiming there is magic in a black box.
Part 10 — What This Concretely Changes for an SME
In brief
This is where the whole topic becomes very concrete.
For an SME, a structured contradiction process is useful wherever a bad decision is costly.
For example:
| Business question | What organized disagreement brings |
|---|---|
| Should we launch a new offer? | It forces the benefits, drawbacks, execution risks, and customer confusion to surface within the same process. |
| Should we hire or automate? | It prevents locking into a false binary choice framed too early. |
| Should we raise our prices? | It stress-tests assumptions about margin, customer loss, value perception, and positioning. |
| Which supplier to choose? | It compares promises, risks, dependency, hidden costs, and operational fit. |
| Is this marketing positioning right? | It challenges message clarity, differentiation, and customer interpretation. |
| Is this tender response solid? | It tests gaps, contradictions, weak arguments, and unjustified assumptions. |
This changes the role of AI.
It is no longer just a machine that gives you one answer.
It becomes a system that helps you verify whether a response deserves to survive.
That is a very different posture.
And frankly, I think it is the healthiest one.
IMPORTANT: for a costly decision, do not ask AI for confirmation. Ask it to attack your reasoning from multiple angles.
Part 11 — The Limits, in the Interest of Staying Credible
In brief
A multi-model process is better for many important decisions.
But let us stay clear-eyed.
It is not free. It is not instantaneous. It is not a truth machine. And it does not eliminate prompt injection, hallucinations, or shared failure modes.
That is exactly the right level of frankness on this topic.
Anthropic itself acknowledges that training models is difficult and that actual behavior can deviate from constitutional ideals; it also describes the constitution as a work in progress, a living document. In my own README, I explicitly state that defenses against prompt injection are probabilistic, not cryptographic, and cannot guarantee full protection. (anthropic.com)
def use_multi_model_verdict(question, data):
verdict = deliberate(question, data)
if verdict.is_high_stakes:
require_human_review(verdict)
if data.is_weak:
reduce_confidence(verdict)
return verdict
Explanation
- More models does not eliminate the need for human judgment.
- Weak data still produces weak results.
- Multiple models can still share similar blind spots.
- The process improves stress-testing; it does not abolish uncertainty.
So yes, I firmly believe in structured contradiction.
But I do not believe in outsourcing judgment.
The right target is not certainty.
The right target is a decision process that is less fragile, less self-complacent, and more resistant to lazy agreement.
Conclusion
If I had to reduce this entire article to a single sentence, it would be:
The problem is not just that AI can be wrong. The problem is that it can be wrong while seeming deeply convincing, precisely because it accepted your framing without resistance.
That is why I no longer find the “one prompt, one response, one model” pattern sufficient for serious work.
For brainstorming, fine. For low-stakes writing, fine. For quick exploration, fine.
But for decisions that involve money, time, reputation, or strategic direction, I increasingly want something else:
- multiple initial opinions,
- structured tension,
- anonymous review,
- a synthesis,
- and then a human who remains in charge throughout.
In other words: not blind trust, but organized disagreement.
And I think this shift in posture will increasingly matter for developers, founders, tech leads, and SMEs who use AI seriously.
References
- Sharma et al. — Towards Understanding Sycophancy in Language Models. (arXiv)
- Anthropic — Claude’s Constitution and Claude’s new constitution. (anthropic.com)
- Andrej Karpathy — LLM Council and its technical notes. (GitHub)
- SynthanAI — SPAR-Kit and the STASH / hybrid mode documentation. (GitHub)
- Community-maintained prompt collections, useful only as unofficial illustrations. (GitHub)
- My project
ai_challengers(GitHub)
Stay tuned for new articles and happy coding.