takes a five-minute change and returns eight clear sections. Choices. Motion objects. Dangers. Open questions. Every part reads prefer it was written by somebody who was paying consideration.
Learn the underlying transcript, although, and you discover that two of these sections had been inferred from a single ambiguous sentence, one was invented completely, and three had been pattern-matched from the mannequin’s prior on what a gathering abstract ought to include. Assured, formatted, structurally indistinguishable from a abstract of a gathering the place these issues truly occurred.
This isn’t a hallucination downside within the typical sense. The mannequin just isn’t making up a reality concerning the world. It’s making up a reality concerning the assembly. And the failure mode just isn’t seen within the output. It’s simply confident-sounding textual content that the reader can’t simply confirm in opposition to the supply.
There’s a identify for this failure mode in one other area, and it’s older than language fashions. It’s what occurs if you do estimation with out identification.
This text just isn’t a brand new summarization benchmark. It’s an argument for a design sample that I’ve not seen handled because the central design constraint in AI engineering literature: deal with LLM-generated summaries as structured claims over a supply, require every declare to declare its help class, and constrain overview levels to allow them to solely weaken unsupported claims quite than make the output smoother. I’ll stroll by what that appears like in apply, what it produces, and the place it breaks.
The lacking step
Causal inference is the analytical custom that formalizes the distinction between figuring out a amount and estimating one. Identification is the argument that the info you’ve gotten can help the declare you wish to make. Estimation is the process that produces a quantity as soon as identification is settled. The order just isn’t negotiable. You can’t estimate a remedy impact you haven’t first argued is identifiable out of your observational information, as a result of the ensuing quantity is meaningless. It appears like an impact. It isn’t an impact.
Practitioners who work in observational settings spend a considerable fraction of their time on identification. They draw causal graphs. They argue about confounders. They distinguish between what the info can help and what the info can’t. The estimation step, when it lastly comes, is commonly the straightforward half.
Now take into account what an LLM summarizer does. It receives a transcript. It produces structured claims concerning the content material of that transcript: selections made, commitments accepted, dangers raised, subsequent steps assigned. Every declare is, in an actual sense, an estimate of a latent amount. The choice was made or it was not. The dedication was accepted or it was not. The abstract is asserting a price for every of those portions.
There isn’t any identification step. The mannequin doesn’t ask whether or not the transcript comprises sufficient proof to help the declare. It produces the declare as a result of the format requires one.
LLM summarization behaves like observational evaluation, however it’s typically deployed with out something resembling an identification step.
The AI engineering literature has not been silent on the underlying downside. Hallucination detection, calibrated uncertainty, selective prediction and abstention, RAG grounding, quotation verification, factual consistency, and declare verification: every of those is a critical line of labor, and every addresses an actual layer of the failure. What they’ve in widespread is that they deal with fabrication as a mannequin conduct to be measured, scored, or suppressed after the actual fact.
Identification is a unique layer. It doesn’t rating the output for trustworthiness. It adjustments what the mannequin is allowed to say within the first place by requiring each declare to declare what it’s and the place it got here from. The 2 layers are complementary. A pipeline that does identification effectively nonetheless advantages from calibration and grounding work downstream. A pipeline that does solely the downstream work is filtering output that ought to by no means have been produced within the type it was produced.
What identification appears like for a transcript
Identification in observational information is a query about what the info can help. Identification for a transcript is similar query, narrowed to a selected supply. Given this transcript, what could be noticed instantly, what could be inferred with acknowledged assumptions, and what can’t be supported in any respect?
That’s the complete transfer. Each declare a summarizer produces ought to declare which of these three classes it belongs to. Noticed claims level to a selected span of the transcript and assert nothing past what that span says. Inferred claims declare the idea being made and the proof the inference is bridging. Suggestions declare that they’re the mannequin’s suggestion, not the individuals’ choice.
A summarizer that can’t place a declare into a type of classes has no enterprise producing the declare. The correct output in that case just isn’t a smoother declare. It’s no declare.
That is uncomfortable for the patron of summaries, as a result of it means many sections shall be empty when the underlying dialog was skinny. That discomfort is the purpose. It’s data. It tells the reader that the assembly didn’t, the truth is, produce eight sections of substance, no matter what the summarizer wished to write down.
A pipeline that enforces the self-discipline
The structure follows from the framing. Three LLM levels and a deterministic renderer.

Picture by Writer
The primary stage extracts structured information from the transcript. Speaker turns, specific commitments, specific selections, specific portions. This stage is intentionally conservative. It’s allowed to overlook issues. It isn’t allowed to invent them.
The second stage synthesizes these information into declare objects throughout eight sections. Every declare carries a label: noticed, inferred, or suggestion. Every declare carries a pointer to the proof within the extracted information. Synthesis is the place the analytical work occurs, and it’s also the place the mannequin is most definitely to float.
The third stage audits. That is the stage that does the identification work, and the constraint on it’s the a part of the design that issues most.
The audit stage can’t rewrite the evaluation into one thing smoother. It can’t add a better-sounding suggestion. It can’t invent lacking context.
It’s given a bounded set of operations and forbidden from doing the rest. It will probably delete a declare. It will probably downgrade a declare from noticed to inferred, or from inferred to suggestion. It will probably transfer a declare to a extra acceptable part. It will probably exchange a declare with an specific insufficient-evidence placeholder. It will probably collapse a complete part when nothing in it survives overview.

Something not on this record is forbidden, together with writing higher claims.
Picture by Writer
The replace_with_insufficient_evidence operation deserves its personal line. It’s the system actually typing a placeholder into the output the place a assured declare was once. That’s identification work made operational. The reader sees, in prose, precisely the place the synthesis stage produced a declare that the supply couldn’t help.
Why the asymmetry issues. A reviewer that’s allowed to enhance the evaluation turns into one other supply of the identical downside the system is attempting to resolve. A reviewer that’s solely allowed to weaken or take away can solely fail in a single route: by being too cautious. That could be a tolerable failure mode. The alternative just isn’t.
What the design produces, and what it refuses to provide
This isn’t a benchmark. It’s a small fixture-based stress take a look at designed to test whether or not the structure produces the conduct it was constructed to provide. Three transcripts are usually not sufficient to make common claims about LLM summarization. They’re sufficient to test whether or not a selected design selection has the implications the design predicted.
The fixtures are: a choice assembly through which a pricing mannequin was chosen amongst three actual options, a working session that surfaced a measurement downside with out resolving it, and a skinny two-person sync that contained nearly no choice content material.
What didn’t occur. Throughout the three runs, the pipeline produced zero fabricated commitments and nil ungrounded portions. That is what the structure is designed to make more durable. A declare can’t survive the pipeline if it doesn’t have a pointer to proof, and the audit stage can’t manufacture proof to maintain a declare alive. The outcome just isn’t a assure. The deterministic renderer is the one stage that provides ensures. Extraction, synthesis, and audit are nonetheless LLM calls and might nonetheless fail. The purpose is that the structure pushes their failures towards removing quite than towards fabrication, and the fixtures are in line with that.
What did occur. The outcome that I discover extra attention-grabbing is the abstention price.

Throughout three fixture transcripts, the share of empty part slots rose from 17% to 58%.
Throughout all three fixtures: 0 fabricated commitments, 0 ungrounded portions.
Picture by Writer
On the wealthy choice assembly, the pipeline left seventeen p.c of part slots empty or changed with the insufficient-evidence placeholder. On the working session, the determine rose to 25 p.c. On the skinny sync, it reached fifty-eight p.c. The system produced roughly three and a half instances as many empty sections when the enter sign was skinny in comparison with when it was wealthy.
That’s the conduct the design is attempting to provide. A summarizer that fills the identical eight sections no matter enter just isn’t summarizing. It’s producing output that conforms to a template. The template is doing the work, and the mannequin is the beauty end.
A summarizer that abstains in proportion to the thinness of the enter is doing one thing completely different. It’s treating the transcript as a supply whose content material varies, and it’s letting that variation present up within the output. The empty sections are usually not failures of the mannequin. They’re the mannequin declining to say what the supply doesn’t help.

Excerpts from the decision-meeting fixture, with the specific labels surfaced inline.
Picture by Writer
Studying the outcome. The labels are usually not ornament. They modify what the reader does with the output. An noticed declare invitations verification in opposition to the transcript. An inferred declare invitations scrutiny of the idea that produced it. An insufficient-evidence placeholder invitations the reader to both have a look at the supply themselves or settle for that the assembly didn’t, the truth is, produce a declare of that form.
The objection from the patron
There may be an argument that vacant sections are a usability downside. The reader anticipated a abstract. The reader bought a partial abstract with specific gaps. The reader has to do extra work.
That objection deserves a direct reply. The reader who bought a fluent eight-section abstract of a five-minute change was already doing extra work, simply invisibly. They had been going to learn the abstract, act on it, and in some unspecified time in the future uncover that two of the motion objects weren’t truly agreed to and one of many dangers was by no means raised. The price of that discovery is excessive. It’s paid in misallocated conferences, missed commitments, and the gradual erosion of belief within the tooling.
Trustworthy vacancy pushes the associated fee ahead. The reader sees the hole instantly and might resolve easy methods to deal with it. Open the transcript. Ask a participant. Deal with the assembly as inconclusive. Every of these is a greater response than performing on a assured abstract that was generated from a confidence the supply didn’t earn.
This is similar commerce observational analysts make once they refuse to report a degree estimate with out identification. The buyer would like a quantity. The analyst declines. The choice the patron makes from no quantity is, on common, higher than the choice they might have constructed from a quantity the info couldn’t help.
Generalizing the sample
The structure transfers. Any LLM workflow that produces structured claims from a supply could be reframed as observational evaluation and given an identification layer.
Doc overview for authorized discovery. Affected person word summarization. Buyer name evaluation. Code overview summaries. Every of those is presently deployed as a one-shot technology downside, with a mannequin producing structured output from a supply and the patron trusting the outcome. Every of them has a model of the identical failure mode the assembly summarizer has, and every could be made extra auditable with the same structure: an extraction stage that’s conservative about what it pulls from the supply, a synthesis stage that produces labeled claims with proof pointers, and an audit stage that’s forbidden from including or strengthening something. The implementation and the chance profile differ throughout these domains. The sample transfers. The specifics don’t.
The labels and the proof pointers are usually not non-compulsory options. They’re the identification step made operational. A declare with out a label just isn’t identifiable. A declare with out an proof pointer can’t be audited. The audit stage’s monotonic-weakening constraint is what prevents identification work from being undone by a mannequin that desires to provide smoother output.
What this implies for the individuals constructing these methods
Calibrated uncertainty estimates are precious. Hallucination benchmarks are precious. Grounding and quotation work are precious. None of them substitute for the self-discipline of refusing to provide a declare that the supply doesn’t help.
That self-discipline is lacking from many LLM methods partly for cultural causes. The sector grew out of machine studying, the place the objective of a mannequin is to provide an output for each enter. The notion that the suitable output is usually no output just isn’t overseas to the literature, however it’s overseas to the default disposition of a generative mannequin educated to fill in what comes subsequent. It’s, nevertheless, native to observational evaluation, the place the suitable reply to many questions is that the info can’t help a solution.
So the strategies for making LLM analytical methods reliable could not come primarily from inside the LLM literature. They might come from disciplines which have already labored out what it means to do sincere evaluation beneath circumstances the place the supply is the binding constraint. Causal inference is a type of disciplines. Survey methodology is one other. Forensic accounting is one other.
The individuals who already know easy methods to refuse to estimate with out identification have an unusually good vantage level on what’s flawed with present LLM analytical tooling, and what to do about it.
Causal inference taught a technology of practitioners to not estimate what they haven’t first recognized. LLM summarizers make the identical mistake, simply in prose as a substitute of numbers. The repair is not only a greater mannequin. The repair is to place again the step that observational evaluation by no means let go of, and to implement it with an structure that can’t be talked out of doing the suitable factor.
Just a few closing pitfalls
- Treating the labels as beauty. If the labels are usually not enforced upstream, they’re ornament. They must be assigned at synthesis with a pointer to proof and audited downstream in opposition to that pointer. A synthesis stage that produces a label with out an proof pointer just isn’t doing identification work. It’s producing a class that appears like identification.
- Letting the audit stage be useful. That is the straightforward mistake. A reviewer that may add a suggestion, provide lacking context, or rewrite a slipshod declare feels helpful. It’s also precisely the failure mode the synthesis stage already has, simply dressed up as high quality management. Constrain the audit to a hard and fast set of weakening operations. Anything is the system arguing with itself.
- Complicated abstention with low high quality. A summarizer that returns largely empty sections on a skinny assembly just isn’t failing. A summarizer that returns assured eight-section output on the identical skinny assembly is failing, simply invisibly. The best way to judge these methods just isn’t abstract completeness, it’s whether or not the abstention price scales with the sign within the supply.
- Reasoning from three fixtures to common claims. Three transcripts are sufficient to test whether or not a design selection produces the conduct it was constructed to provide. They don’t seem to be sufficient to make claims about LLM summarization usually. If you happen to construct a model of this, you’ll need your individual fixture set and your individual definition of what counts as the suitable degree of abstention in your use case.
The asymmetry that issues
A pipeline that may solely weaken its outputs has a single failure mode: it may be too cautious. A pipeline that may strengthen its outputs has each failure mode the literature has been documenting for the final a number of years.
Selecting the primary variety over the second variety just isn’t a technical choice. It’s a choice about what the system is for. If the system is for producing fluent textual content, the second variety wins on each metric. If the system is for producing claims a reader can audit earlier than performing, solely the primary variety is defensible.
Most present tooling is constructed for the primary objective and deployed as if it had been constructed for the second. Treating that hole as a methodological downside quite than a model-quality downside is what adjustments the obtainable treatments.
Repository, analysis harness, and instance outputs can be found on GitHub. The total pocket book walks one transcript by each stage and runs the eval harness throughout all three fixtures.
Workers Knowledge Scientist targeted on causal inference, experimentation, and choice science. I write about turning ambiguous enterprise questions into decision-ready evaluation.
Extra like this on LinkedIn 👇

