I caught my own software lying to me
Editor's note: TaloeMed now runs 12 quality gates on every note. This article walks through the original seven-gate TrustGate phase that landed first. The five additional gates that came later — entity cross-referencing, ICD-10 verification, drug normalization, documentation quality scoring, and ontological coherence — extend the same validation discipline.
A patient walks in. Bilateral nasal obstruction, headache, getting worse over two weeks. I record our conversation, and about 40 seconds after I hit stop, there's a full SOAP note on my screen. ICD-10 codes, differentials, management plan. Reads beautifully.
Except the Assessment says "deviated nasal septum" and I never examined the septum. The patient didn't mention deviation either. The AI saw "nasal obstruction" in the transcript and filled in the statistically likely cause. It invented a clinical finding and wrote it up like I'd dictated it.
That was the moment I realised generating the note is the easy part. The hard part is making sure the note isn't making things up. Language models produce fluent text. They have no idea when they've fabricated something. In a blog post, that's an annoyance. In a clinical document, it could end a career.
One validation check doesn't cut it
When I first started testing AI-generated notes in my own OPD, about 80% were good. Usable, accurate, properly structured. The other 20% had quiet problems: a finding the patient never reported, a drug interaction the system missed, a PubMed citation that linked to a paper about a completely different condition. Three different failure modes, and a single check can't catch all of them.
So I built seven independent gates as the first phase. Every note passed through all of them before any doctor saw it. We called it TrustGate™. Six of those seven were pure code — deterministic logic, no AI involved, no per-note cost. The seventh used AI, but deliberately a different model from the one that generated the note. You don't let a student grade their own exam.
Gate 1 — Structural validation: Is the note actually complete?
Sounds basic. It isn't.
Language models truncate. Network hiccup, context fills up, and you get a detailed Subjective and Objective followed by an Assessment that says "likely sinusitis." Full stop. That's it. For a patient with two weeks of worsening headache and bilateral obstruction, that's not documentation. That's a guess someone typed in a hurry.
Without Gate 1, truncated notes slip through looking like real notes. A doctor glancing quickly might not notice the Plan section is empty. Gate 1 does notice. Every section present, every section with actual content, or the note gets flagged before it reaches the screen.
Gate 2 — Complexity linting: Does the depth match the case?
A patient with chronic rhinosinusitis, bilateral nasal polyps, and two prior FESS procedures needs a detailed note. If the Assessment comes back as two sentences, something went wrong with the generation.
Gate 2 measures section depth against case complexity. Straightforward acute otitis media? A brief Assessment is fine. Recurrent cholesteatoma with facial nerve involvement? Two sentences won't do. Without this gate, you get notes that look complete but are clinically thin for the presentation. The kind of note that looks fine until it doesn't, usually during a complication or a medicolegal query.
Gate 3 — Case agreement: Does this match what we know?
I've built a knowledge base of 376+ validated ENT cases. All Indian clinical context, Indian drug names, Indian presentations. Gate 3 checks every generated note against similar cases in that database.
Here's a real scenario: the AI writes up sudden unilateral hearing loss and pins the Assessment on cerumen impaction. Wax. Sure, wax is common. But sudden unilateral loss needs urgent workup for sensorineural causes. You can't lead with wax when the presentation screams SNHL. Gate 3 catches this because it knows what similar cases actually look like. Without it, the AI defaults to whatever is statistically most frequent, which is not the same as clinically most important.
Gate 4 — Confidence scoring: How much should you trust this note?
Every note gets a ConfidenceIndex™ score from 0 to 100. The score has to be honest. A clean 10-minute consultation with good audio and clear history deserves a high number. A 90-second recording where the patient was talking to a family member while the kid was screaming in the background? That score should be low, and the doctor should know it.
Gate 4 looks at three things: how clean the transcript was, how much clinical ground the conversation covered, and how solid the reasoning chain is. Bad audio drops the score. Missing history drops it again. When you see a ConfidenceIndex™ of 45, you know to read every line carefully. When you see 92, you can review faster. Without this, every note looks equally confident, which is a lie.
Gate 5 — Error feedback loop: Your corrections teach the system
Here's what happens without a feedback loop: the AI suggests amoxicillin-clavulanate for every ear infection. You change it to ofloxacin drops because the patient has a perforation. Tomorrow, same thing. Next week, same thing. The system never learns.
Gate 5 is DocLoop™. When I edit a note — change a diagnosis, remove a fabricated finding, adjust a drug — that correction enters the system with a structured error category. Not a free-text comment. A categorised correction. If I've changed the same drug suggestion three times this week, the system starts flagging it for my review on future notes. It won't suppress the suggestion entirely (another doctor might actually want it), but it tells me: "You've overridden this before. Heads up."
Gate 6 — Citation verification: Are the references real?
This one makes me angry every time. Language models fabricate PubMed references. A plausible PMID number. A plausible-sounding title. A real journal name. Completely made up.
I once saw a generated note cite a "2023 study in the Indian Journal of Otolaryngology" with a specific PMID. The paper did not exist. The PMID belonged to a dermatology paper from 2019. If that citation landed in a patient record and someone pulled it up during a medicolegal review, the doctor looks like they either fabricated evidence or didn't bother checking. Neither is good.
Gate 6 takes every PMID in the note, verifies the paper exists, and checks that it actually supports the claim being made. No verification, no citation in the final document.
Gate 7 — HalluciGuard™: Trace every claim to the transcript
This is the only gate that uses AI, and it runs on a completely separate model from the one that wrote the note. On purpose.
HalluciGuard™ goes through every clinical statement in the note and traces it back to the original transcript. "Patient reports 2 weeks of bilateral nasal obstruction" — did the patient actually say that, or something close enough? "Examination revealed septal deviation to the left" — did the doctor mention this during the recording, or did the AI infer it from the chief complaint?
Claims that can't be traced to the transcript get flagged. Not deleted. Flagged. Maybe the AI made a reasonable inference. Maybe it invented something. The doctor sees the flag and decides. That distinction matters: the system catches potential fabrications, but it doesn't override clinical judgement.
Why most of those original gates didn't use AI
This was a deliberate choice. AI inference costs money, adds latency, and has its own error rate. You can't validate an AI note with more AI and call that safe. It's turtles all the way down.
Six gates are pure code. Pattern matching, database lookups, structural checks. They run in under 200 milliseconds, cost nothing per note, and can't hallucinate because they're not language models. They're if-statements and SQL queries. Boring, reliable, fast.
Only hallucination detection genuinely needs language understanding — you need a model to read the transcript and the note and compare them semantically. Everything else is factual. A missing SOAP section is a fact. A non-existent PMID is a fact. Code checks facts better than any model does.
You can't skip the gates
TrustGate™ runs on every single note. There's no bypass, no "quick mode," no checkbox to disable validation because you're in a hurry. The note appears on your screen only after every gate has run. And even then, every field is editable. Accept it, reject it, change one word — entirely your call.
The doctor always has the final say. That's not a disclaimer — that's the architecture.