Your test suite is lying to you
We had a 97.5% pass rate on our tool-calling test suite. Two hundred test cases. Function calling, parameter extraction, response formatting — all of it tested and passing. We were confident enough to ship.
The first real multi-turn conversation with a patient hallucinated.
Not on the first turn. The first turn was perfect — the agent greeted the patient, identified their intent, and asked the right follow-up question. The second turn was fine too. By the fourth turn, the patient had changed their mind about the appointment time, and the agent — instead of updating the request — called the scheduling tool with the original time while telling the patient it had used the new time. Both the agent and the patient thought the appointment was at 3 PM. The calendar said 2 PM.
Our 97.5% pass rate meant nothing. And the reason is subtle enough that most teams will not catch it until they have the same experience.
Single-turn tests verify capability. Can the model call a tool when given a prompt? Can it extract the right parameters from user input? Can it format a response correctly? These are necessary tests. They are not sufficient tests.
Multi-turn tests verify reliability. Can the model maintain context across six turns of conversation? Can it handle a patient who says "actually, make that 3 PM instead" on turn four without losing the provider selection from turn two? Can it recover when the patient interrupts with an unrelated question ("oh wait, do you take Blue Cross?") and then return to the scheduling flow? Can it avoid repeating an action it already performed when the conversation loops back to a similar topic?
These are fundamentally different testing challenges. Single-turn tests are stateless. Multi-turn tests require the test harness to maintain conversation context, simulate realistic patient behavior patterns — including interruptions, corrections, tangents, and emotional states — and verify that the model's internal state remains consistent across the entire interaction.
Most teams are not doing multi-turn testing. We know this because we were not doing it either, and we thought we were thorough. Our test suite was comprehensive by any standard metric. It just tested the wrong thing.
The gap between single-turn and multi-turn testing is where patients get hurt. A model that passes 97.5% of single-turn tests might fail 30% of multi-turn conversations in ways that are invisible to the test suite.
After our production failure, we rebuilt our entire testing approach. We now test with simulated conversations that run 8-12 turns, include at least two patient corrections, one interruption, and one ambiguous statement. We test with conversations in both English and Spanish. We test with conversations that start about scheduling and end up being about billing. We test with patients who are frustrated, patients who are confused, and patients who give one-word answers.
Our pass rate dropped from 97.5% to 76% on the first run of the new test suite. That 21-point gap was the risk we had been shipping with.
If you are building AI that talks to patients — or any AI that handles multi-turn interactions with real humans — and your test suite only runs single-turn evaluations, you are testing whether your car can start. You are not testing whether it can navigate traffic. Those are not the same thing, and the difference shows up the first time a real human behaves like a real human.
This is one piece of a larger framework we built and operate in production. The full picture — and how it applies to your business — is in the playbook.
We specialize in healthcare because it is the hardest vertical — strict HIPAA regulation, PHI handling, BAA chains, and zero tolerance for failure. If we can build it for healthcare, we can build it for any industry. We work across verticals.