← Back to The 11 Things That Will Break Your AI in Production
2026-04-13·Ryan Bolden·Part of: The 11 Things That Will Break Your AI in Production

The prompt cliff nobody warns you about

For three months, our voice agent worked beautifully. Patients called, the agent answered, appointments got scheduled, questions got answered, emergencies got routed to the on-call provider. The team was focused on other things. When an edge case appeared — a patient asking about something the agent handled awkwardly — someone would add a line to the prompt. "When asked about X, respond with Y." Small additions. Reasonable additions. Each one made the agent handle that specific case better.

The prompt grew from 10,000 characters to 38,000 characters. Nobody tracked this. Nobody had a policy about prompt size. It grew the way codebases grow — one reasonable commit at a time, until the total exceeds what any individual addition would suggest.

Then one day — not gradually, not with warning signs — the agent broke. Tool calls started failing. Instructions were ignored. The agent began generating responses that had no relationship to the conversation it was having. A patient asked about rescheduling and the agent responded with information about insurance verification. Another patient got a response in a language neither of them was speaking.

No code had changed. No model had been updated. No infrastructure had been modified. The prompt had simply crossed a threshold.

This is the prompt cliff. It is not in any documentation. It is not in any vendor's best practices guide. But every team that has shipped a production AI agent with an evolving prompt will hit it.

Here is what happens at the architectural level. Large language models process prompts by distributing attention across all tokens. When the prompt is short, the model can give meaningful attention to every instruction. When the prompt is long, the model's attention becomes diluted. Instructions compete with each other. The model resolves conflicts by choosing the statistically most likely response given the full context — which, with 38,000 characters of sometimes-contradictory instructions, is essentially random.

The failure mode is not graceful degradation. It is a phase transition. Below the threshold, the model follows instructions reliably. Above the threshold, the model's behavior becomes unpredictable. The cliff is sharp because attention is not a linear resource — there is a point where the model's ability to prioritize collapses, and when it collapses, everything breaks at once.

The fix was radical reduction. We cut the prompt from 38,000 characters to under 8,000. Not by removing features — by relocating them. Every piece of static knowledge — office hours, provider lists, service descriptions, insurance information — moved out of the prompt and into tool responses. The model no longer carries this information in its context window. It asks for it when it needs it.

The result: the agent handles more complexity than before, not less. It knows everything it knew at 38,000 characters. But instead of holding all of that knowledge simultaneously and trying to prioritize across it, it retrieves the specific information relevant to the current conversation turn.

This is the counterintuitive lesson. Making the prompt shorter made the agent smarter. Not because shorter prompts are inherently better, but because shorter prompts let the model focus its attention on what matters right now, and tools let it retrieve what it needs when it needs it.

We now have a strict prompt budget. Every addition to the prompt must justify its presence in the attention window. If it can be a tool response instead, it becomes a tool response. If it can be a verification check instead of an instruction, it becomes a verification check. The prompt is the most expensive real estate in the system. Every token in it competes with every other token for the model's attention. We treat it accordingly.

If your AI agent's prompt has been growing for months and nobody is tracking the size — or if you have noticed the agent getting "dumber" without any obvious cause — you may be approaching the cliff. The fix is not to tune the prompt. The fix is to fundamentally rethink what belongs in the prompt versus what belongs in the tool layer.

This is one piece of a larger framework we built and operate in production. The full picture — and how it applies to your business — is in the playbook.

We specialize in healthcare because it is the hardest vertical — strict HIPAA regulation, PHI handling, BAA chains, and zero tolerance for failure. If we can build it for healthcare, we can build it for any industry. We work across verticals.

Written by Ryan Bolden · Founder, Riscent · ryan@riscent.com