The 11 Things That Will Break Your AI in Production
Eleven hard lessons from shipping a HIPAA-compliant voice agent, patient portal, and CRM to production. Real failures, real fixes, no theory.
The demo worked. Production did not.
There is a pattern in healthcare AI that plays out the same way almost every time. The team builds a proof of concept. It works in the demo. The investor is impressed. The clinical partner nods. Everyone agrees: "Ship it." Then real patients start using it. And the system breaks in ways nobody anticipated — not because the technology is wrong, but because nobody tested it against the chaos of real human behavior. We know this pattern because we lived it. We built a HIPAA-compliant voice agent, patient portal, and CRM that serves real patients in production. Over the course of twelve months, we documented every failure, every production incident, every 3 AM alert. Eleven of those failures were severe enough that they changed how we architect systems permanently.
The hallucination that no one catches in testing
Here is a failure mode that will not appear in any test suite built by a team that has not shipped to production. Your AI agent tells a patient their appointment is confirmed. The patient says thank you and hangs up. But the agent never called the booking tool. It hallucinated the confirmation. The patient shows up to an appointment that does not exist. In behavioral health, that patient might have been in crisis. They waited a week for that appointment. This is not a rare edge case. It happened to us. And we discovered that the fix is not better prompting, not a longer system message, not a higher-quality model. The fix is structural — and it lives at a layer most teams have not thought about yet. We understand how to make anthropomorphic behavior actually work. It is not accomplished by making long prompts or focusing strictly on prompt engineering alone. Knowing how to develop and orchestrate tools can, when done right, accomplish a radical reduction in prompt token usage. Models perform better with dynamic retrieval than static knowledge baked into instructions.
Your test suite is lying to you
We had a 97.5% pass rate on our tool-calling test suite. We were confident. First real multi-turn conversation with a patient: the agent hallucinated. The reason is subtle and nearly universal. Single-turn tests verify that the model can call a tool when asked. They do not verify that the model can maintain context across six turns of conversation, recover when the patient changes their mind mid-sentence, handle an interruption, or avoid repeating an action it already performed. Multi-turn testing is a different discipline entirely. Most teams are not doing it. We know because we were not doing it — until production taught us the difference at the cost of patient trust.
The prompt cliff nobody warns you about
Over three months our voice agent prompt grew from 10,000 to 38,000 characters. Small additions. A new instruction here. A guardrail there. Another edge case. Performance did not degrade gradually. It fell off a cliff. One day the agent was fine. The next day — without a single code change — tool calls started failing, instructions were ignored, and the agent began generating responses that had no relationship to the conversation. There is a threshold in prompt size where large language models do not get slightly worse. They break. We learned exactly where that threshold is, what causes it, and how to architect a system that stays well below it while handling more complexity, not less. The answer is counterintuitive. It involves making prompts dramatically shorter, not longer.
The database write that poisoned every patient
A single INSERT statement with a missing column poisoned a shared database session. Every subsequent request — for every patient, across every tenant — used the corrupted connection. The platform went down for hours. This is not a hypothetical. It is a known failure pattern in multi-tenant systems using shared ORM connections, and it is invisible in development environments because development does not run with shared sessions under concurrent load. If you are building a multi-tenant healthcare platform and you have not explicitly designed your session management for failure isolation, you have a ticking bomb. We have the architecture pattern that prevents this. We also have the scar tissue from the night it went off.
When your security is a checkbox, not an architecture
"We signed a BAA with OpenAI." That is not HIPAA compliance. That is one document in a chain that requires row-level security, audit logging, encryption at rest, minimum-necessary data flows, and infrastructure-level access controls. We have seen systems where the AI agent can access any patient's records regardless of who is asking. Systems where the system prompt — which contains tenant-specific configuration — can be extracted with a single prompt injection. Systems where there is no audit trail of what the AI said to a patient. Each of these is a HIPAA violation. Each of these was a failure we encountered, diagnosed, and built permanent prevention for.
The uncomfortable math on AI voice costs
Voice AI pricing looks simple until you do the math at scale. A three-minute call at current per-second rates costs more than most teams budget for. Multiply by 1,700 calls a month. Now factor in failed calls that retry, long holds, and bilingual conversations that run longer than monolingual ones. We have run the real numbers on production voice costs — not projections, not estimates, actual invoices across thousands of real patient calls. The unit economics work. But only if you architect for them from day one. Most teams discover the cost problem after they have already locked into a pricing model that cannot absorb it.
88% of AI proofs-of-concept never reach production (IDC, 2025). 95% of healthcare AI pilots fail to deliver measurable ROI (MIT NANDA, 2025). These eleven lessons are why our system is in the 5% that works. We are the only consulting practice that has shipped a full HIPAA-compliant healthcare AI platform to production — solo — and documented every failure along the way. The full playbook with exact fixes is how we ensure our clients skip the failures and go straight to production. Contact us.
We specialize in healthcare — the hardest vertical for AI, with HIPAA regulation, PHI handling, and zero tolerance for error. If we can ship it in healthcare, we can ship it anywhere. We work across industries.
Reply within 24 hours. No pitch deck. No discovery phase. Just whether I can help.