TL;DR
- RAG retrieved the correct doc. The LLM nonetheless contradicted it. That’s the failure this method catches.
- 5 failure patterns: numeric contradictions, pretend citations, negation flips, reply drift, confident-but-ungrounded responses.
- Three therapeutic methods repair dangerous solutions in-place earlier than customers see them.
- No exterior APIs, no LLM choose, no embeddings mannequin — pure Python below 50ms.
- 70 checks, each manufacturing failure mode I discovered has a named assertion.
was mendacity (why I constructed this)
I’m constructing a RAG-powered assistant for EmiTechLogic, my tech training platform. The objective is easy: a learner asks a query, the system pulls from my tutorials and articles, and solutions based mostly on that content material. The LLM output shouldn’t be generic. It ought to replicate my content material, my explanations, what I’ve truly written.
Earlier than placing that in entrance of actual learners, I wanted to check it correctly.
What I discovered was not what I anticipated. The retrieval was working high quality. The proper doc was coming again. However the LLM was producing solutions that straight contradicted what it had simply retrieved. No errors, no crashes. Only a assured, fluent reply that was factually mistaken.
I began researching how widespread this failure is in manufacturing RAG techniques. The extra I seemed, the extra I discovered. This isn’t a uncommon edge case or a bug you’ll be able to patch. It’s a structural property of how RAG works.
The mannequin reads the correct doc and nonetheless generates one thing completely different. The explanations usually are not totally understood: consideration drift, coaching biases, conflicting indicators in context. What issues virtually is that it occurs usually, it isn’t predictable, and most techniques don’t have anything to catch it earlier than the person sees it.
Here’s what makes it extra harmful than customary LLM hallucination. With a plain LLM, a mistaken reply is no less than plausibly unsure. The person is aware of the mannequin is working from coaching knowledge and may be mistaken. With RAG, the mannequin learn the right supply and nonetheless contradicted it. The person has each purpose to belief the reply. The system seems to be like it’s doing precisely what it was designed to do.
The mannequin isn’t simply failing; it’s mendacity with a straight face. It produces these fluent, authoritative responses that look precisely like the reality, proper up till the second they break your system.
I spent months researching documented manufacturing failures, reproducing them in code, and constructing a system to catch them earlier than they attain customers. This text is the results of that work.
All outcomes are from actual runs of the system on Python 3.12, CPU-only, no GPU, besides the place explicitly famous as calculated from recognized inputs.
Full code:https://github.com/Emmimal/hallucination-detector/
Earlier than the rest — right here’s what the system produces
============================= take a look at session begins =============================
collected 70 objects
TestConfidenceScorer 5 handed
TestFaithfulnessScorer 5 handed
TestContradictionDetector 7 handed
TestEntityHallucinationDetector 5 handed
TestAnswerDriftMonitor 6 handed
TestHallucinationDetector 24 handed
TestQualityScore 18 handed
============================= 70 handed =============================
70 checks. Each named failure mode I’ve encountered has an assertion. That quantity is the purpose of this text, not a footnote on the finish.
The place most RAG techniques fail
Most RAG tutorials cease at: retrieve paperwork, stuff them right into a immediate, name the mannequin.
That works till it doesn’t.
The entire promise of retrieval-augmented technology is grounding. Give the mannequin actual paperwork and it’ll use them. In observe, RAG creates a failure mode that’s extra harmful than vanilla hallucination, not much less.
This isn’t about conflicting retrieved paperwork. That may be a separate drawback I coated in Your RAG System Retrieves the Proper Knowledge — However Nonetheless Produces Flawed Solutions. That is a couple of mannequin that retrieved precisely the correct doc and nonetheless answered incorrectly.
This occurs for causes nonetheless not totally understood: consideration mechanisms drifting to irrelevant tokens, coaching biases towards sure phrasings, the mannequin averaging throughout conflicting indicators in context. What issues for manufacturing is that it occurs usually, it isn’t predictable, and the one dependable technique to catch it’s a verify on the ultimate reply earlier than it leaves your system.
After I was going via documented manufacturing failures, 5 patterns stored displaying up.
The primary is assured mistaken solutions. The mannequin makes use of phrases like “undoubtedly” or “clearly acknowledged” whereas asserting one thing that has no foundation within the retrieved supply. The assertive language is the issue — it removes any sign that the reply may be mistaken.
The second is factual contradictions. The context says 14 days, the reply says 30. The context says annual billing, the reply says month-to-month. The supply was there. The mannequin simply ignored it.
The third is hallucinated entities. Individual names, paper citations, group names that don’t seem anyplace within the retrieved paperwork. The mannequin invents them and presents them as reality.
The fourth is reply drift. The identical query will get a unique reply over time. This one is silent — no error, no flag, nothing. It often will get caught in a monetary audit or a person grievance, not by the system itself.
The fifth is what I name assured however untrue. The mannequin sounds sure all through the reply however most of what it says can’t be traced again to any retrieved supply. Excessive confidence, low grounding. That mixture is probably the most harmful sample I discovered.
Most detection frameworks flag these and hand them again to your software code. None of them repair it. That’s the hole this method closes.
The structure: detect, rating, heal, route
retrieve(question)
→ generate(question, chunks)
→ detector.examine(question, chunks, reply)
→ QualityScore.compute(report)
→ healer.heal(...)
→ ACCEPT / HEALED_ACCEPT / FALLBACK / DISCARD

I wished this to run inside a standard FastAPI request with out including exterior dependencies or blowing the latency finances. No API calls, no embeddings mannequin, no LLM choose. The entire examine() name runs below 50ms with spaCy, below 10ms on the regex fallback. That was the constraint I designed round.
Verify 1: Confidence scoring
In an ideal world, I’d use logprobs to see how certain the mannequin is about its tokens. However in manufacturing, most APIs don’t make these simple to get or mixture.
I wanted a poor man’s logprobs. I constructed the ConfidenceScorer to search for linguistic overconfidence — assertion markers like “undoubtedly” or “assured” weighed in opposition to uncertainty indicators like “may” or “I feel”. Easy phrase counting, normalized by reply size. It sounds too easy to work, however it’s surprisingly efficient at catching the mannequin when it’s bluffing.
def rating(self, reply: str) -> float:
al = reply.decrease()
phrases = len(reply.cut up()) or 1
excessive = sum(len(p.findall(al)) for p in self._HIGH_RE)
low = sum(len(p.findall(al)) for p in self._LOW_RE)
return max(0.0, min(1.0,
0.5 + min(excessive / (phrases / 10), 1.0) * 0.5
- min(low / (phrases / 10), 1.0) * 0.5
))
This verify fires when confidence exceeds 0.75 AND faithfulness falls beneath 0.50. That mixture is what I stored seeing within the failures I researched — the mannequin sounding fully sure whereas most of what it says can’t be traced again to the retrieved supply. Excessive confidence masks the issue. That’s what makes it onerous to catch.
Verify 2: Faithfulness scoring
The FaithfulnessScorer splits the reply into factual declare sentences, then checks what fraction of every declare’s content material phrases seem within the mixed context.
def _claim_grounded(self, declare: str, context_lower: str) -> bool:
kw = _key_words(declare)
if not kw:
return True
return sum(1 for w in kw if w in context_lower) / len(kw) >= self.overlap_threshold
A declare is grounded if no less than 40% of its key phrases seem within the supply. Rating = grounded claims / complete claims.
I settled on 40% as a result of it offers sufficient room for pure paraphrasing with out letting fabrication via. In case you are working in authorized or medical contexts, begin at 0.70 — paraphrasing itself is a danger there. Questions get a free go at 1.0 since they don’t assert something.
Verify 3: Contradiction detection
After I was constructing this verify, I wanted to resolve what truly counts as a contradiction. I narrowed it down to 3 patterns that confirmed up most frequently within the failures I checked out.
The primary is numeric contradictions. The reply makes use of a quantity that isn’t in any retrieved chunk, and the identical subject within the context has a unique quantity close by. That is the best case and the commonest.
The second is negation flips. The context says “doesn’t help X”, the reply says “helps X”. I ended up matching eight negation patterns bidirectionally — doesn’t, can’t, by no means, no, isn’t, gained’t, don’t, didn’t — in opposition to their optimistic equivalents. Getting this proper took extra iteration than I anticipated.
The third is temporal contradictions. Similar unit, completely different worth, similar subject. Context says 14 days, reply says 30 days. That’s it.
One implementation element that brought about me issues early on — the quantity extractor has to exclude letter-prefixed identifiers:
def _extract_numbers(textual content: str) -> set[str]:
# SKU-441 → excluded (letter+hyphen prefix means it is a label)
# 5-7 days → preserved (digit+hyphen is a numeric vary)
# $49.99 → preserved
return set(re.findall(
r'(?
With out this rule, each product code containing a quantity triggers a false optimistic. SKU-441 incorporates 441 — that may be a label, not a value.
Verify 4: Entity hallucination detection
This verify extracts named entities from the reply individuals, organizations, citations and verifies each seems in no less than one retrieved context chunk.
The shift from regex to spaCy NER was not non-obligatory. Here’s what the 2 approaches produce on the identical reply:
Reply: "The seminal work was revealed by Dr. James Harrison and
Dr. Wei Liu in arXiv:2204.09876, at DeepMind Analysis Institute."
Context: "Current research present transformer fashions obtain 94% accuracy on NER duties."
───────────────────────────────────────────────────────────────────
REGEX NER (v1) — false positives on noun phrases
───────────────────────────────────────────────────────────────────
Flagged as hallucinated:
Dr. James Harrison (particular person) ✓ appropriate
Dr. Wei Liu (particular person) ✓ appropriate
arXiv:2204.09876 (quotation) ✓ appropriate
Scaling Named (particular person) ✗ FALSE POSITIVE
Entity Recognition (particular person) ✗ FALSE POSITIVE
NER Duties (org) ✗ FALSE POSITIVE
───────────────────────────────────────────────────────────────────
spaCy en_core_web_sm (manufacturing) — clear output
───────────────────────────────────────────────────────────────────
Flagged as hallucinated:
Dr. James Harrison (particular person) ✓ appropriate
Dr. Wei Liu (particular person) ✓ appropriate
arXiv:2204.09876 (quotation) ✓ appropriate
DeepMind Analysis Institute (org) ✓ appropriate
In v1, my regex fallback was a bit too aggressive—it flagged phrases like ‘Named Entity Recognition’ as particular person names simply due to the capitalization. Upgrading to spaCy’s statistical mannequin fastened this. It truly understands that ‘Scaling Named’ is a syntactic pair, not a human being. This single change killed the false positives that have been making Verify 4 nearly unusable in my earlier checks.
Verify 5: The half no person screens
This was the final verify I constructed and the one I nearly minimize.
I used to be unsure drift detection could be well worth the complexity. Three months into testing, it caught a pricing endpoint silently returning a unique worth after a retrieval index rebuild. Not one of the different 4 checks fired. That was sufficient to maintain it.
Drift shouldn’t be about whether or not a single reply is appropriate. It’s about whether or not your system is behaving constantly over time.
The AnswerDriftMonitor shops light-weight fingerprints of each reply per query in SQLite:
def _fingerprint(self, reply: str) -> dict:
numbers = sorted(set(re.findall(r'bd+(?:.d+)?(?:%|okay|m)?b', reply)))
phrases = reply.decrease().cut up()
key_words = [w for w in words if len(w) > 5][:20]
pos = sum(1 for w in phrases if w in self._POS)
neg = sum(1 for w in phrases if w in self._NEG)
return {
"numbers": numbers,
"key_words": key_words,
"polarity": "optimistic" if pos > neg else ("detrimental" if neg > pos else "impartial"),
"length_bucket": len(reply) // 100,
}
I didn’t wish to retailer the total reply textual content within the database. That will get massive quick and creates privateness floor space I didn’t want. So the fingerprint shops solely what is critical to detect significant change — the numbers within the reply, the highest 20 content material phrases, the general polarity, and a size bucket. That is sufficient to catch actual drift with out the database rising unbounded.
On every new reply, the monitor compares it in opposition to the final 10 responses for that query. If the gap goes above 0.35, that’s drift. If common similarity drops beneath 0.65, that’s drift too. Each thresholds got here from testing, not from concept.

The vital manufacturing element: this makes use of SQLite, not an in-memory dictionary. An in-memory construction resets on each course of restart. In an actual deployment with rolling restarts, drift detection successfully by no means fires. SQLite persists throughout deploys. You’re expecting degradation over days, not minutes.
# The take a look at that caught an actual mistake throughout staging
def test_persistence_across_instances(self, tmp_path):
db_file = str(tmp_path / "drift.db")
mon1 = AnswerDriftMonitor(db_path=db_file)
for _ in vary(5):
mon1.document(query, stable_answer)
mon2 = AnswerDriftMonitor(db_path=db_file) # recent occasion, similar file
detected, delta = mon2.document(query, drifted_answer)
assert detected # historical past from mon1 continues to be there
This take a look at exists as a result of my staging atmosphere was restarting each half-hour and the drift monitor had been blind the whole time.
The self-healing layer
The HallucinationHealer makes an attempt considered one of three deterministic repair methods, then re-inspects the outcome. If the healed reply nonetheless fails re-inspection, it serves a protected decline as an alternative of delivering a mistaken reply.
Therapeutic precedence order:

Technique A: Contradiction patch
Technique A is a direct repair: if the mistaken quantity is within the reply, swap it for the correct one from the context. It sounds easy, however billing cycle normalization was a nightmare.
The difficulty was the order of operations. If I ran the patterns within the mistaken sequence, I’d get messy output like ‘yearly subscription’ as a result of the noun was changed earlier than the adjective may very well be adjusted. To repair this, I made the system detect the general path first (annual vs. month-to-month) after which apply a single ordered go. It handles particular patterns first earlier than falling again to adjectives. This solved the grammar drift that was making the ‘self-healing’ a part of the system look damaged.
Earlier than: "The Professional plan prices $10 per 30 days, billed month-to-month.
You'll be able to cancel your month-to-month subscription at any time."
Context: "The Professional plan prices $120 per 12 months, billed yearly."
After: "The Professional plan prices $120 per 12 months, billed yearly.
You'll be able to cancel your annual subscription at any time."
Adjustments logged:
— Changed '$10' → '$120'
— Normalized billing: 'bper monthb' → 'per 12 months'
— Normalized billing: 'bmonthly subscriptionb' → 'annual subscription'
— Normalized billing: 'bbilled monthlyb' → 'billed yearly'
— Confidence recalibrated: 0.50 → 0.65 (contradiction_patch)
Technique B: Entity scrub
Technique B is easier. If the reply incorporates hallucinated entities, I take away the sentences that include them. The half I thought of fastidiously was what to inform the person. Silently deleting sentences felt mistaken — the person would get a shorter reply with no rationalization. So I added a transparency word each time one thing will get eliminated, so that they know the reply was trimmed and why.
for despatched in sentences:
if any(identify.decrease() in despatched.decrease() for identify in fake_names):
eliminated.append(despatched)
else:
clear.append(despatched)
if eliminated:
outcome += (
" Notice: particular names or references couldn't be verified "
"within the supply paperwork and have been omitted."
)
If each sentence within the reply incorporates a hallucinated entity, scrubbing produces nothing. In that case the healer falls via to protected decline moderately than returning an empty response.
Technique C: Grounding rewrite
When faithfulness is beneath 0.30, I rebuild the reply from scratch utilizing the top-ranked context sentences by key phrase overlap with the query. The prefix issues right here. I didn’t wish to use one thing like “Primarily based on obtainable info” as a result of that tells the person nothing about the place the reply is definitely coming from. So the prefix is chosen based mostly on what the context truly incorporates:
if re.search(r'$d+|d+%|d+s*(day|month|12 months)', combined_lower):
prefix = "In line with the supplied knowledge:"
elif any(w in combined_lower for w in ("coverage", "guideline", "process", "rule")):
prefix = "Per the supply documentation:"
else:
prefix = "The supply signifies that:"
“Primarily based on obtainable info” tells the person nothing about the place the data got here from. These three prefixes do.
Confidence recalibration by technique
After therapeutic, confidence is recalibrated based mostly on what was carried out, not re-run blindly on the healed textual content:
| Technique | Recalibration | Rationale |
|---|---|---|
| contradiction_patch | unique + 0.15, capped at 0.80 | Deterministic repair from verified supply — earns increased confidence |
| entity_scrub | unique × 0.85 | Eliminated dangerous sentences; remaining textual content continues to be the mannequin’s personal output |
| grounding_rewrite | Re-run ConfidenceScorer on healed textual content | Hedging prefix does the work — “In line with…” scores decrease naturally |
Therapeutic outcomes throughout all 5 situations
State of affairs 1 — Assured lie (30 days → 14 days)
Preliminary: CRITICAL → contradiction_patch → Remaining: LOW
Confidence: 1.00 → 0.80
State of affairs 2 — Hallucinated quotation: Dr. James Harrison, arXiv:2204.09876
The mannequin invented two researchers and a paper quotation. None of them seem anyplace within the retrieved context.
Anticipated end result: grounding_rewrite rebuilds the reply from context —
faithfulness of 0.00 fires the primary precedence verify earlier than entities
are thought of.
State of affairs 3 — Billing contradiction ($10/month → $120/12 months)
Preliminary: CRITICAL → contradiction_patch → Remaining: LOW
Confidence: 0.50 → 0.65
State of affairs 4 — Reply drift (SKU-441 value diverged)
Preliminary: CRITICAL → grounding_rewrite → Remaining: LOW
Confidence: 0.50 → 0.50
State of affairs 5 — Clear reply
Preliminary: LOW → no_healing_needed → Remaining: LOW
Confidence: 0.62 → 0.62 (unchanged)
Two issues price noting. State of affairs 2 triggers grounding_rewrite, not entity_scrub — as a result of faithfulness was 0.00, which fires the primary precedence verify earlier than entities are thought of. State of affairs 4 is CRITICAL moderately than MEDIUM as a result of the drifted reply additionally incorporates a numeric contradiction ($39.99 vs $49.99), so drift and contradiction collectively push it to CRITICAL. These are actual outputs from the demo, not illustrative summaries.
Operating it your self
Clone the repo and run all 5 situations:
git clone https://github.com/Emmimal/hallucination-detector.git
cd hallucination-detector
Run a single situation with python demo.py --scenario 3. Here’s what State of affairs 3 — the billing contradiction — produces finish to finish:
── DETECT ──────────────────────────────────────────────────────
Query : How a lot does the Professional plan price?
Danger : CRITICAL
Confidence: 0.50
Devoted : 0.50
Contradict: True — Numeric contradiction: reply makes use of '10' however
context exhibits '120' close to 'professional'
Faux names: []
Drift : False (delta=0.00)
Triggered : ['contradiction']
Latency : 11.9ms
── SCORE ───────────────────────────────────────────────────────
Rating : 0.40 → HEALED_ACCEPT
Parts:
trustworthy 0.20 / 0.40
constant 0.00 / 0.30
confidence 0.10 / 0.20
latency 0.10 / 0.10
── HEAL ────────────────────────────────────────────────────────
Technique : contradiction_patch
Preliminary danger : CRITICAL → Remaining danger : LOW
Earlier than: The Professional plan prices $10 per 30 days, billed month-to-month.
You'll be able to cancel your month-to-month subscription at any time.
After: The Professional plan prices $120 per 12 months, billed yearly.
You'll be able to cancel your annual subscription at any time.
Adjustments:
— Changed '$10' → '$120'
— Normalized billing: 'bper monthb' → 'per 12 months'
— Normalized billing: 'bmonthly subscriptionb' → 'annual subscription'
— Normalized billing: 'bbilled monthlyb' → 'billed yearly'
— Confidence recalibrated: 0.50 → 0.65 (contradiction_patch)
CRITICAL danger in, LOW danger out. The mistaken reply is fastened in-place, the billing cycle language is normalized all through, and each change is logged. The person will get a corrected reply. You get a full document of precisely what was mistaken and what was modified.
High quality scoring and supply routing
Cross/fail shouldn’t be sufficient for an actual deployment. You have to know not simply whether or not a solution failed, however how badly — and what to do about it.
QualityScore computes a weighted composite that routes each reply to considered one of 4 supply tiers:
final_score = 0.40 × faithfulness
+ 0.30 × consistency (0.0 if contradiction discovered)
+ 0.20 × confidence (calibrated in opposition to faithfulness degree)
+ 0.10 × latency_score (non-linear penalty curve)
− 0.20 × drift_penalty (express deduction, utilized final)
| Routing | Situation | What you ship |
|---|---|---|
| ACCEPT | rating ≥ 0.75, no therapeutic | Authentic reply |
| HEALED_ACCEPT | therapeutic utilized, re-inspection handed | Healed reply |
| FALLBACK | rating < 0.50, not healed | Retry or decline |
| DISCARD | therapeutic served protected decline | Protected decline message |
Log each HEALED_ACCEPT individually from ACCEPT. They’re your sign for what the mannequin constantly will get mistaken.
Latency penalty: Full marks below 20ms (pure Python, regex NER), linear decay from 0.10 to 0.05 throughout the 20–50ms band (spaCy operating), steep decay towards 0.00 at 200ms. The break at 50ms displays an actual manufacturing constraint — that’s the place spaCy NER begins showing in a typical FastAPI latency finances.
Drift deduction: Utilized final. Drift is a sentinel for retrieval pipeline well being. A currently-grounded reply from a degrading pipeline ought to nonetheless path to fallback, as a result of previous inconsistency predicts future unreliability. The -0.20 is utilized in any case different elements so it could push any reply beneath the edge no matter present faithfulness.
Efficiency traits
Measured on Python 3.12, CPU solely, no GPU:
| Operation | Latency | Notes |
|---|---|---|
| Confidence scoring | < 1ms | Regex sample matching |
| Faithfulness scoring | ~2ms | Key phrase overlap calculation |
| Contradiction detection | ~1ms | Regex + quantity extraction |
| Entity detection — spaCy | ~45ms | en_core_web_sm NER |
| Entity detection — regex | < 1ms | Fallback path, no spaCy required |
| Drift document + verify | ~3ms | SQLite write + similarity question |
| Full examine() — regex NER | < 10ms | Pure Python path |
| Full examine() — spaCy NER | < 50ms | Manufacturing path |
When you want sub-10ms end-to-end, the regex NER fallback is a one-line config change. You commerce some entity detection precision for latency. For many customer-facing deployments the spaCy path at below 50ms provides no perceptible delay.
The checks: 70 instances, not demos
Each named manufacturing failure has an assertion. Here’s what these 70 checks cowl:
TestConfidenceScorer (5 checks)
— assertive reply scores excessive
— hedged reply scores low
— rating bounded 0.0–1.0 throughout all inputs
— empty string dealt with
— unicode reply dealt with
TestFaithfulnessScorer (5 checks)
— grounded reply scores ≥ 0.80
— fabricated reply scores ≤ 0.50 with ungrounded record
— empty reply returns excellent rating
— question-only reply excluded from claims
— strict threshold config produces decrease scores
TestContradictionDetector (7 checks)
— numeric contradiction detected with purpose
— matching numbers go cleanly
— negation flip detected
— temporal contradiction detected
— clear reply passes
— empty context dealt with
— empty reply dealt with
TestEntityHallucinationDetector (5 checks)
— fabricated particular person flagged
— fabricated quotation flagged
— entity current in context not flagged
— empty reply returns empty record
— widespread discourse phrases not false-positived
TestAnswerDriftMonitor (6 checks)
— no drift on first reply
— no drift on constant solutions
— drift detected after significant change
— completely different questions don't intrude
— persistence throughout cases (SQLite, new object, similar file)
— clear historical past resets to zero
TestHallucinationDetector (24 checks)
— 5 manufacturing situations with appropriate danger ranges
— danger CRITICAL on contradiction
— danger LOW on clear reply
— empty reply, empty context, very lengthy reply, unicode, single phrase
— HallucinationBlocked exception carries full report
— strict config triggers confident_but_unfaithful
— stats monitoring and reset
— 20-thread concurrent examine() with zero errors
— ainspect() returns appropriate report
— ainspect() detects hallucination accurately
— asyncio.collect() with 10 concurrent ainspect() calls
— stats studies NER backend accurately
TestQualityScore (18 checks)
— ACCEPT on excessive rating
— FALLBACK on low rating
— HEALED_ACCEPT when therapeutic utilized
— DISCARD when protected decline served
— latency full marks below 20ms
— latency diminished at 35ms, above 50ms flooring
— latency steep penalty at 60ms, flooring at zero
— drift subtracts precisely 0.20, floored at zero
— contradiction_patch boosts confidence, caps at 0.80
— entity_scrub reduces confidence by issue 0.85
— grounding_rewrite: hedged scores decrease than assertive
— to_dict consists of drift_penalty area
Two of those checks exist due to actual errors I made throughout improvement. The thread security take a look at runs 20 concurrent examine() calls as a result of I initially had a race situation that solely confirmed up below load — not in regular single-call testing. The SQLite persistence take a look at creates a recent monitor occasion pointing on the similar database file as a result of my staging atmosphere was restarting each half-hour, and I found the drift monitor had been fully blind the whole time. An in-memory dictionary resets on restart. SQLite doesn’t. Each checks are there as a result of the bugs already occurred as soon as.
============================= 70 handed =============================
Sincere limits and design choices
Understanding what a system doesn’t catch is as essential as realizing what it does. Each threshold in DetectorConfig is a deliberate start line, not an arbitrary quantity.
Why 0.75 confidence threshold? Beneath this, most solutions include sufficient pure hedging to keep away from false positives. Above it, excessive assertiveness mixed with low faithfulness is the sample I noticed most frequently within the failures I researched. Tune it all the way down to 0.60 for high-stakes domains the place earlier flagging is well worth the extra assessment load.
Why does 0.40 faithfulness overlap? That is the minimal required to tolerate pure paraphrasing with out falsely flagging grounded solutions that use completely different wording. Authorized and medical deployments ought to begin at 0.70 — in these domains, paraphrase is itself a danger, not a tolerance.
Why 0.35 drift threshold? Empirically tuned on a small question set. A tighter threshold (0.20) fires too early throughout regular immediate variation. A looser threshold (0.50) misses actual degradation. Your appropriate worth depends upon how a lot pure variation your LLM produces for a similar query.
What this won’t catch:
Assured, constant hallucinations. If the mannequin all the time says “30 days” and the context additionally says “30 days,” all checks go. This technique assumes retrieved context is appropriate. It can’t detect dangerous retrieval — solely solutions that deviate from or contradict what was retrieved.
Ingenious paraphrase that modifications which means. At 40% key phrase overlap, a fastidiously phrased fabrication can technically go the faithfulness verify. The edge is a dial — tune it on labeled samples out of your area.
Negation with stemming mismatches. The negation detector checks for “can cancelled”, not “can cancel”. A sentence like “you’ll be able to cancel” technically slips via. Stemming earlier than sample matching closes this hole and is on the roadmap for v4.
Drift as a trailing indicator. The drift monitor requires no less than three prior solutions earlier than it fires. Some dangerous solutions will probably be served earlier than detection. It tells you when to research. It doesn’t stop the primary few failures after a pipeline change.
Set up and utilization
pip set up spacy
python -m spacy obtain en_core_web_sm
No extra pip dependencies past spaCy. SQLite ships with Python’s customary library.
Fundamental utilization:
from hallucination_detector import (
HallucinationDetector, HallucinationHealer,
DetectorConfig, QualityScore
)
config = DetectorConfig(db_path="drift.db", log_flagged=True)
detector = HallucinationDetector(config)
healer = HallucinationHealer(detector)
# Examine each LLM reply earlier than supply
report = detector.examine(query, context_chunks, llm_answer)
rating = QualityScore.compute(report)
if rating.routing == "settle for":
return llm_answer
# Try therapeutic
outcome = healer.heal(query, context_chunks, llm_answer, report)
rating = QualityScore.compute(report, healing_result=outcome)
if rating.routing == "healed_accept":
return outcome.healed_answer
return fallback_response
Async (FastAPI):
report = await detector.ainspect(query, context_chunks, llm_answer)
Structured JSON logging:
from hallucination_detector import configure_logging
import logging
configure_logging(degree=logging.WARNING)
# Each flagged response emits a structured JSON WARNING with the total report
Blocking on vital danger:
if report.is_hallucinating:
elevate HallucinationBlocked(report)
# HallucinationBlocked.report carries the total dict in your monitoring layer
Strict mode for authorized or medical contexts:
config = DetectorConfig(
faithfulness_threshold=0.70, # up from 0.50
faithfulness_overlap_threshold=0.70, # up from 0.40
confidence_threshold=0.60, # down from 0.75 — flag earlier
drift_threshold=0.25, # down from 0.35 — extra delicate
db_path="drift_production.db",
log_flagged=True,
)
What subsequent
Three issues are on my record for the following model. The primary is surfacing therapeutic modifications to the person straight — proper now corrections occur silently, which feels mistaken in domains the place customers have to know the mannequin was mistaken. The second is aggregating drift indicators throughout questions moderately than per-question, so I can detect when a whole doc retailer begins degrading moderately than catching it one query at a time. The third is a calibration harness that generates precision/recall curves from actual site visitors, so threshold tuning doesn’t must be carried out by hand.
Closing
I constructed this as a result of I wanted it. If you end up constructing a RAG system that learners will truly depend on, you can not afford to ship solutions you haven’t inspected. The mannequin will retrieve the correct doc and nonetheless generate one thing mistaken. That’s not a bug you’ll be able to repair on the mannequin degree. It’s a property of how these techniques work.
The 70 checks usually are not proof this method is ideal. They’re proof that I perceive precisely what it catches and what it doesn’t, and that each failure sample I discovered throughout analysis now has a named assertion.
retrieve() → generate() → examine() → rating() → heal() → ship
The mannequin will hallucinate. The retrieval will fail.
The query is whether or not you catch it earlier than your customers do.
The complete supply code: https://github.com/Emmimal/hallucination-detector/
References
- Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., … & Kiela, D. (2020). Retrieval-Augmented Technology for Data-Intensive NLP Duties. Advances in Neural Data Processing Methods, 33, 9459–9474. https://arxiv.org/abs/2005.11401
- Honovich, O., Aharoni, R., Herzig, J., Taitelbaum, H., Kukliansy, D.,Cohen, V., Scialom, T., Szpektor, I., Hassidim, A., & Matias, Y. (2022). TRUE: Re-evaluating factual consistency analysis. Proceedings of NAACL 2022, 3905–3920. https://arxiv.org/abs/2204.04991
- Min, S., Krishna, Okay., Lyu, X., Lewis, M., Yih, W., Koh, P. W., … & Hajishirzi, H. (2023). FActScore: Effective-grained Atomic Analysis of Factual Precision in Lengthy Kind Textual content Technology. Proceedings of EMNLP 2023, 12076–12100. https://arxiv.org/abs/2305.14251
- Manakul, P., Liusie, A., & Gales, M. J. F. (2023). SelfCheckGPT: Zero-Useful resource Black-Field Hallucination Detection for Generative Massive Language Fashions. Proceedings of EMNLP 2023. https://arxiv.org/abs/2303.08896
- Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength Pure Language Processing in Python. Zenodo. https://doi.org/10.5281/zenodo.1212303
- Gao, Y., Xiong, Y., Gao, X., Jia, Okay., Pan, J., Bi, Y., … & Wang, H. (2023). Retrieval-Augmented Technology for Massive Language Fashions: A Survey. arXiv preprint arXiv:2312.10997. https://arxiv.org/abs/2312.10997
- Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). Ragas: Automated Analysis of Retrieval Augmented Technology. arXiv preprint arXiv:2309.15217. https://arxiv.org/abs/2309.15217
Disclosure
I’m an unbiased AI researcher and founding father of EmiTechLogic (emitechlogic.com). This venture was constructed as a part of my analysis into RAG system failures whereas creating a RAG-powered assistant for EmiTechLogic. The failure patterns described on this article have been researched from documented manufacturing failures within the area and reproduced in code. The code was written and examined regionally in Python 3.12 on Home windows utilizing PyCharm. All libraries used are open-source with permissive licenses (MIT). The spaCy en_core_web_sm mannequin is distributed below the MIT License by Explosion AI. I’ve no monetary relationship with any library or software talked about. GitHub repository: https://github.com/Emmimal/hallucination-detector. I’m sharing this work to doc a sample that prices actual groups actual time, to not promote a services or products.

