AI Agents

Why AI Agents Fail in Production

The default assumption is that the model failed. In practice, production agent failures trace to context construction — the engineering problem that decides whether an agent works.

When an AI agent gives a wrong answer in production, the default assumption is that the model failed. The team upgrades to a larger model, or switches providers, or fine-tunes on domain-specific data. Sometimes this helps. More often, the same failures persist — because the model was never the problem.

In practice, the vast majority of production agent failures trace to context. The agent lacked the right information at inference time, or had too much undifferentiated information, or had stale information, or had information that conflicted across sources. The model reasoned correctly given what it could see. It just couldn't see enough.

Context construction — assembling the right information into the prompt at the right moment — is the engineering problem that determines whether an agent works. This essay explains why, and provides a practitioner's framework for solving it.

The model is not the bottleneck

The AI industry spends enormous energy on modelmarks, parameter counts, and reasoning capabilities. These matter. A more capable model reasons more effectively, handles more complex instructions, and recovers more gracefully from ambiguity. Model selection is a real decision with real consequences.

But in production agent deployments, model selection explains a surprisingly small share of quality variance.

Sebastian Raschka's State of LLMs 2025analysis captures this well: performance progress is increasingly coming from improved tooling and inference-time scaling rather than from training or the core model itself [1]. The models will keep getting better, and that improvement will look dramatic on benchmarks. In production, the gains show up at the margins. The architectural decisions around the model — what information it can access, how that information is assembled, how the agent decides what it needs — explain most of the quality difference between an agent that works and one that doesn't.

Research supports this at the systems level. A JetBs study presented at NeurIPS 2025 found that agent contexts grow rapidly and become expensive, yet do not deliver significantly better downstream task performance [2]. Pouring more tokens into the context window produces diminishing returns. A separate 2025 study on context length found that even with perfect retrieval — the model has access to exactly the right information — longer context alone degrades performance [3]. More information is not better information. Better information is better information.

The controllable variable, the one practitioners can actually engineer, is what the model sees at inference time. That is where quality lives.

Why agents hallucinate

The standard explanation for hallucination is that the model "made something up." This framing is technically accurate and practically useless, because it implies the solution is a model that makes things up less often. The actual mechanism is more specific, and understanding it changes how you design systems.

Large language models trained to be helpful. Reinforcement learning from human feedback optimizes for responses that users rate as useful, complete, and responsive to the question asked. This training creates a measurable bias toward producing an answer even when the available context doesn't support one.

The research is unambiguous. Sharma et al. (2025) studied five generative AI assistants and found that sycophancy — the tendency to produce agreeable, confident responses regardless of factual grounding — is a general behavior, with LLMs preserving face 47% more often than humans in comparable evaluations [4]. In April 2025, OpenAI rolled back a ChatGPT-4o update after users reported the system was excessively agreeable, reverting to an earlier version with more balanced behavior [5]. A 2025 study in npj Digital Medicine found that frontier models showed up to 100% initial compliance with illogical medical requests, prioritizing helpfulness over logical consistency [6].

This tendency is particularly pronounced in smallnd open-source models. Closed labs — Anthropic, OpenAI, Google — have invested heavily in alignment and guardrails to mitigate sycophantic behavior, and their models perform measurably better on these dimensions. The underlying pressure toward compliance remains.

For agent builders, the implication is direct. An agent with thin context on a topic will fill in the gaps rather than acknowledge uncertainty. Ask it about "Sarah's update from the meeting" and if there is any context about meetings or anyone named Sarah, the model will construct a plausible-sounding answer. It will sound confident. It will often be wrong.

The fix is not a better model. The fix is matching the query with enough relevant context that the model reasons over real information rather than interpolating from fragments.

Not all context is equal

Even when context exists, it carries different levels of reliability that most agent architectures ignore entirely. A useful framing is the lossy spectrum — each layer of transformaetween raw data and the context your agent ultimately receives introduces compression, interpretation, and loss. Understanding where on that spectrum your context sits determines how much you should trust it.

Primary data is the system of record. A database field, a raw metric from an API, a source document, a transaction log. The signal-to-noise ratio is usually highest here, and primary data should be the default source when available.

That said, primary data is not always the optimal context for an agent. A raw database export of ten thousand transactions carries high fidelity but low interpretability. An expert's analysis of those transactions — identifying the three patterns that matter — may carry higher signal-to-noise for the agent's purposes, in fewer tokens. Knowing when to feed the agent primary data and when to feed it curated analysis is a product and engineering challenge that deserves deliberate design, not a blanket rule.

Analysis of primary data is one transformation layer red on the lossy spectrum. A dashboard aggregation, a quarterly report, a summary document generated from primary sources. Analysis introduces compression — summarization loses nuance, aggregation hides outliers, and the author's framing shapes what the reader sees. Each of these is a form of information loss. At the same time, expert analysis can get to the crux of an issue more effectively than a mass of raw data. A well-written incident report often tells the agent more, faster, than the underlying log files. The loss is real, but so is the gain in interpretability.

Commentary on the analysis sits furthest along the lossy spectrum. A Slack message discussing the report. A meeting note referencing the dashboard. An email thread debating the summary's conclusions. Signal-to-noise degrades significantly at this tier. The information may be accurate, partially accurate, outdated, or someone's interpretation of someone else's interpretation. Each hop introduces additional loss — and unlike the structuross in formal analysis, the loss in commentary is uncontrolled. There is no schema, no methodology, no explicit framing of what was included and what was left out.

An agent that treats all tool call responses as equally reliable will produce confidently wrong answers whenever it reasons over commentary as though it were primary data. The same "fact" — a customer's contract value, a project's deadline, a product's specification — looks different as a CRM field (primary), a QBR slide (analysis), and a Slack message from an account executive (commentary). Three sources, three reliability levels, one entity.

Web search illustrates these challenges with particular clarity. Search is a common context augmentation step, and many agent architectures incorporate it. Practitioners should understand that web search is both incredibly biased and incredibly lossy at the same time. Search results disproportionately return analysis and commentary rather than primary sources — blog posts summarizing research, articles interpreting data, commentary on commentary. The agent ingests these as though they carry the same authority as the underlying source. And practically, most search provider APIs return a URL and a heading — the actual page content requires a separate fetch and extraction step. Without engineering that full pipeline — search, fetch, parse, filter — web search gives the appearance of context enrichment without the substance. These are exactly the kinds of challenges to think about when identifying, building, and curating tools for an agent system.

The architectural implication is that context quality deserves the same rigor as context availability. Having access to information is necessary. Knowing where that information sits on the lossy spectrum — and designing the agent to use the right tier for the right query — is what separates a reliable agent from a confident one.

The solution is tool calling

Every tool an agent can call expands the context available at inference time. This is the singleeverage investment in agent quality.

When an agent lacks access to a relevant system, one of two things happens. Either it hallucates — constructing a plausible-sounding answer from whatever thin context it can find — or it cannot answer that shape of question at all, creating a hard limitation on the agent's usefulness. The first failure erodes trust. The second restricts scope. Both trace to the same root cause: the tool surface is too narrow.

A customer support agent that cannot reach the billing system will hallucinate billing answers or refuse billing questions. A recruiting coordinator that cannot see the ATS will guess at candidate status or go silent. An investor research agent without access to primary data sources will produce analysis built on analysis — or acknowledge that it cannot help.

The model is the same in every case. The information diet is what changes the output.

What is a tool call?

A tool call is a structured interface between the language model and an external system. The model decides it needs information or needs to take an action, emits a structured request conforming to a defined schema, and receives a structured response. The model does not call APIs directly. It expresses intent, and the orchestration layer executes.

This separation matters. The model handles reasoning and intent. The orchestration layer handles authentication, rate limiting, error handling, and response formatting. The tool call is the contract between the two.

What is a tool registry?

A tool registry is the catalog of everything the agent can reach. Each entry typically contains:

A tool name and identifier (often an enum value) that the model uses to invoke it
A natural-language description of what the tool does and — critically — what it does not do. This description is what the model reads when deciding whether to call the tool. A vague or incomplete description produces the same failure as a missing tool, because the model cannot select what it cannot understand.
An input schema defining required and optional parameters with types, constraints, and descriptions
An output schema or description of what the response contains and how to interpret it

The quality of tool descriptions matters enormously. A well-written description — specific about capabilities, explicit about limitations, clear about the shape of data returned — directly improves tool selection accuracy. A poorly written description creates ambiguity that the model resolves by guessing.

The MECE challenge

As the tool surface grows, maintaining mutual exclusivity and collective exhaustiveness across tool definitions becomes increasingly difficult.

Two tools with overlapping descriptions create ambiguity. The model picks one when it should have picked the other, or calls both and either receives conflicting responses or burns through far more tokens than needed to accomplish the task. Gaps between tool definitions create dead zones where a valid query matches nothing and the agent falls baation or silence.

At ten tools, this is manageable through careful description writing. At fifty, it requires deliberate taxonomy design — grouping tools by domain, establishing naming conventions, and testing for overlap. At hundreds, it becomes a first-class engineering problem on par with API design. Tools need clear boundaries, non-overlapping descriptions, and explicit scoping of what each tool does not cover.

How to scale it

The principle — more tools means better context means better output — runs into a scaling wall. As the tool surface grows, you face a design decision about how the agent discovers and selects tools. Three patterns exist, each with tradeoffs.

Pattern 1: Full registry in context. Load every tool definition into the prompt upfront. The agent sees everything and selects what it needs.

This is the simplest approach and works well at small scale. The model has full visibility into its capabilities, and tool selection requires no intermediate steps. The cost is that tonsumption scales linearly with tool count. Anthropic's internal data shows a five-server MCP setup consuming approximately 55,000 tokens in tool definitions before the conversation starts [7]. At fifty or more tools, definitions alone can consume 70,000+ tokens. Accuracy also degrades — tool selection errors increase as the model encounters similarly named tools in a crowded context window.

Pattern 2: Tools-available endpoint. Expose a lightweight list_tools or tools_available call that returns tool names and one-line descriptions. The agent queries this first, selects what it needs, then loads full definitions only for the selected tools.

This dramatically reduces upfront token cost and preserves the agent's agency over tool selection. The tradeoff is an additional round-trip and a dependency on short descriptions being sufficiently informative. The agent must infer from a one-line summary whether a tool is relevant, and short descriptions can be ambiguous. This pattern requires careful descripon writing and testing.

Pattern 3: Tool search tree (dynamic discovery). The agent has access to a search tool that queries the registry on demand. Tools are loaded into context only when the search returns them as relevant. Anthropic's Tool Search Tool is the reference implementation of this pattern, supporting regex and BM25 search across tool names, descriptions, and argument schemas [7].

The gains are significant. In Anthropic's benchmarks, context consumption dropped approximately 85%. Accuracy improved meaningfully — Opus 4 went from 49% to 74%, and Opus 4.5 from 79.5% to 88.1% on MCP evaluations with Tool Search enabled [7]. The pattern scales to hundreds or thousands of tools.

The tradeoffs are real. Search adds latency before tool invocation. The agent must infer the right search terms, and retrieval accuracy is imperfect — independent testing showed 56% accuracy with regex and 64% with BM25 at 4,000+ tools [8].

There is also a more fundamental limitation: the circular bootstrap problem agent will only search for a tool if it recognizes it needs a capability it doesn't currently have. If the agent has no awareness that a capability exists— because the tool was deferred and the agent has never encountered a query that would trigger it — it will not search for it. You don't know what you don't know. Anthropic addresses this partially by allowing practitioners to keep a set of tools eagerly loaded (always visible to the model) while deferring the rest, but the bootstrap problem remains an open design challenge for fully dynamic discovery at scale.

Our recommendation

For most production agent deployments, a hybrid approach works best.

Keep a small set of high-frequency tools loaded eagerly — the five or six tools the agent uses on more than half of all queries. These are the tools the agent should always know about, with zero discovery latency. Defer everything else behind a search layer. As the tool surface grows past roughly twenty tools, dynamic discovery stops being optionabecomes structural.

Invest in tool description quality with the same rigor you invest in API documentation. Descriptions are the interface between the model and your infrastructure. Ambiguous descriptions produce ambiguous tool selection. Test for MECE violations systematically — overlap and gaps in your tool definitions are the most common source of selection errors at scale.

Treat tool integration as a first-class engineering discipline. Every new tool is context the agent did not have yesterday. Every missing tool is a hallucination waiting to happen.

Other considerations

Tool depth vs. tool breadth. Connecting to a billing system is breadth. Exposing line items rather than invoice totals is depth. Both matter independently. An integration that exposes only top-level objects will still produce hallucinated answers on detail-level questions. When evaluating the tool surface, ask two questions: can the agent reach this system at all (breadth), and can it reach the specific data it needs within that system (depth)?

Freshness architecture. Different data sources change at different rates, and the tool layer needs to encode this. A product catalog changes weekly. An order status changes hourly. An account balance changes in real time. An agent that caches a real-time data source at an hourly cadence will produce stale answers during the intervals — and stale answers that look current are worse than missing answers, because the user has no signal that the information is outdated.

Conflicting context across tools.The same entity often means different things in different systems. "Customer" in the CRM, the billing system, and the support tool may reference three different objects with three different identifiers. An agent that pulls from all three and merges them will produce a confident, wrong synthesis. Entity resolution at the integration layer — or at minimum, designating a system of record per entity type — mitigates this. The agent should know which source is authoritative when sodisagree.

Closing

Models will keep getting better. That is someone else's problem, and they are working on it.

The context architecture — what information the agent can access, how it discovers tools, how it distinguishes primary data from commentary, and how it assembles the right information at inference time — is the practitioner's problem.

That is where production quality lives.

References

Raschka, S. (2025). "The State of LLMs 2025: Progress, Progress, and Predictions." Sebastian Raschka's AI Magazine, December 2025. https://magazine.sebastianraschka.com/p/state-of-llms-2025
Lindenbauer, D. et al. (2025). "Cutting Through the Noise: Smarter Context Management for LLM-Powered Agents." Presented at Deep Learning 4 Code Workshop, NeurIPS 2025, San Diego, December 2025. https://blog.jetbrains.com/research/2025/12/efficient-context-management/
"Context Length Alone Hurts LLM Performance Despite Perfect Retrieval." arXiv, October 2025. https://arxiv.org/html/2510.0538
Sharma, M. et al. (2025). "Towards Understanding Sycophancy in Language Models." Published 2025; cited in Institute for Public Relations, "The Hidden Risk of AI Sycophancy in the Workplace," October 2025. https://instituteforpr.org/the-hidden-risk-of-ai-sycophancy-in-the-workplace/
OpenAI. (2025). ChatGPT-4o model rollback due to sycophantic behavior, April 2025. Reported widely; referenced in Sharma et al. and Perez et al.
"When helpfulness backfires: LLMs and the risk of false medical information due to sycophantic behavior." npj Digital Medicine, Volume 8, Article 605, October 2025. https://www.nature.com/articles/s41746-025-02008-z
Wu, B. et al. (2025). "Introducing Advanced Tool Use on the Claude Developer Platform." Anthropic Engineering Blog, November 2025. https://www.anthropic.com/engineering/advanced-tool-use
Independent evaluation of Anthropic Tool Search Tool accuracy across 4,027 tools. Arcade.dev, 2025. https://blog.arcade.dev/anthropic-tool-search-claude-mcp-runtime