How an ambient AI clinical scribe actually works — and what the marketing leaves out

Every AI-scribe vendor's pitch starts at the output: a beautifully structured SOAP note appearing on your screen seconds after the consult. Almost none of them explain what happens between the microphone and the note. Here's the actual pipeline.

Stage one: speech-to-text

The microphone picks up audio. A speech-to-text model — usually Whisper or a Whisper derivative, sometimes a clinical-trained variant — converts it to text. Quality varies massively by accent, ambient noise, and the model's training data. Australian-accent transcription was meaningfully worse than US-accent transcription until about 18 months ago. It's now roughly equivalent.

Stage two: speaker diarisation

The text is segmented by speaker — typically clinician vs patient. This is harder than it sounds in a small room with one microphone. Most failures we see in clinical AI happen here: a long patient utterance gets attributed to the clinician, or a clinician's note-to-self gets dropped into the patient's history.

Stage three: structured extraction

An LLM (usually GPT-4-class or Claude-class) reads the diarised transcript and extracts the structured note. Subjective. Objective. Assessment. Plan. Differential. Recommendations. This is where the model can hallucinate clinical content — a problem that's largely solved with retrieval-augmented generation but isn't fully solved.

Stage four: clinical validation

The structured note is checked against the clinical knowledge base. Citations are inserted (NICE, RACGP, BMJ Best Practice). The note is rendered in the clinician's preferred template (SOAP, BIRP, HPI). Anything that can't be validated is flagged for clinician review.

What the marketing leaves out

Failure modes. What happens when the patient's accent isn't in the training data well. What happens when two speakers talk over each other for 30 seconds. What happens when the LLM is asked to summarise a 45-minute consult on a complex patient with five conditions. Different scribes handle these failure modes differently — and that's where the meaningful differences between vendors actually live.

Built around the failure modes, not the happy path.See how MedMETs handles them