Reconstructing the Desk of Contents a PDF Forgot to Ship, So RAG Can Scope by Part

doc parsing companion in Enterprise Doc Intelligence, the sequence that builds an enterprise RAG system from 4 bricks. It extends Article 5 (doc parsing) on one desk: toc_df, the doc’s part construction, which Article 5 fills from the PDF’s native define (PyMuPDF’s doc.get_toc) when there may be one. This half is in regards to the case the place there isn’t, reconstructing that construction from what the doc nonetheless reveals on the web page.

*the place this companion sits: it extends Article 5 (doc parsing), inside Half II (the 4 bricks), reconstructing the desk of contents when the PDF ships none – Picture by writer*

Open NIST FIPS 202, the SHA-3 commonplace (a US Authorities work, public area, see the NIST copyright assertion), and switch to web page seven. There’s a clear desk of contents: part titles on the left, web page numbers on the fitting. Now open the identical file in any PDF viewer and take a look at the bookmarks pane. Empty. The contents web page is ink on a web page, not construction the machine can use. The writer wrote a wonderfully good desk of contents, and the file shipped with out exposing it.

Article 5 (doc parsing) and Article 5B (the relational knowledge mannequin) leaned on doc.get_toc(), the PDF’s native define, to fill toc_df. It’s actual when it exists. It usually doesn’t. Loads of actual paperwork, papers exported straight from LaTeX, contracts printed to PDF, authorities requirements, carry a printed contents web page however no define. For these, toc_df comes again empty, though the doc is telling you its construction in plain sight on web page seven.

That construction just isn’t a nicety. Retrieval scopes by part (Article 7). The chunker cuts on heading boundaries (Article 5B). Summarization walks the doc part by part. Each a type of steps reads toc_df. When it’s empty, retrieval falls again to scanning each web page, the chunker splits on blind web page breaks, and the reply loses the doc’s personal construction. So the query this text solutions is slim and sensible: when the file ships no define however prints a contents web page, how do you flip that web page again right into a toc_df?

One factor up entrance, as a result of it’s simple to conflate. That is about paperwork that have a contents web page. A doc with no contents web page in any respect, a paper that simply opens with “1. Introduction”, a five-page memo, an export that stripped each heading, is a special drawback. Recovering a skeleton from the physique of an unstructured doc is summarization, a separate intent that builds the map from the chunks relatively than studying one off a web page. Right here we solely ever learn a contents web page the doc already has.

1. Two halves: learn the entries, then discover their actual pages

It helps to separate two issues a contents web page offers you. The primary is a listing of sections with titles and a hierarchy: what the doc is about, in what order. The second is a map from every part to the place it bodily begins within the file. The native define arms you each without spending a dime. Studying a printed contents web page arms you the primary straight, however the second solely as printed labels, which aren’t bodily pages. The 2 halves have completely different failure modes, so the remainder of this text retains them separate: first learn the entries, then align them to bodily pages.

In: a PDF whose doc.get_toc() returns nothing however that prints a contents web page. Out: a toc_df with the identical form Article 5B outlined (stage, title, start_page, end_page, breadcrumb), so all the things downstream retains working unchanged.

The contents web page is available in two flavours, and so they value completely different quantities to learn.

2. Three circumstances, by ascending value

*The cascade tries every case in flip and stops on the first that yields a usable TOC. Picture by writer.*

Every case has a detection step and an extraction step, and falls by way of to the following when it fails or returns too little.

Case 1, native define. Dealt with in Article 5 by build_toc_df. Free, actual, hierarchical. When it really works there may be nothing to do. We recap it solely to set the fee baseline.
Case 2, contents web page with hyperlinks. No define, however an early web page lists titles as hyperlinks pointing contained in the file. The hyperlink goal is the bodily web page, so this case skips the alignment drawback completely.
Case 3, contents web page with out hyperlinks. A web page that appears like a printed contents (titles, dot leaders, right-aligned web page numbers) however carries no hyperlinks. The web page numbers it prints are labels within the doc’s personal numbering, not bodily pages, so this case wants the alignment step.

All of this lives in a module of its personal, separate from the native path so Article 5 stays readable. The entry level is reconstruct_toc_df.

3. Observe the hyperlinks

Case 2 is the fortunate one. Some paperwork don’t have any define however do ship a clickable contents web page. The NIST Cybersecurity Framework is one: web page two lists each part as a hyperlink that jumps into the doc. PyMuPDF exposes these hyperlinks per web page, and every inside hyperlink carries its goal web page straight.

In: the PDF (hyperlinks are usually not in line_df, so this reader opens the file). Out: entries with a title and the bodily goal web page, already resolved.

The detection is a density examine: a web page with 5 or extra inside hyperlinks is a navigation web page, not a physique web page with the odd footnote hyperlink. The extraction joins every hyperlink’s clickable rectangle again to the textual content beneath it, then strips the leaders and the trailing web page label.

import fitz   # PyMuPDF

def extract_toc_from_links(pdf_path, min_links=5):
    """The contents web page is the web page carrying essentially the most inside hyperlinks."""
    doc = fitz.open(pdf_path)
    finest = []
    for web page in doc:
        entries = []
        for hyperlink in web page.get_links():
            if hyperlink["kind"] != fitz.LINK_GOTO:        # inside leap solely
                proceed
            label = clear(text_under_rect(web page, hyperlink["from"]))
            if label:
                entries.append({"title": label,
                                "start_page": hyperlink["page"] + 1,  # goal web page
                                "stage": 1})
        if len(entries) >= min_links and len(entries) > len(finest):
            finest = entries                            # richest hyperlink web page wins
    return finest

Run it on the Framework and the recovered contents are clear:

*Each title resolved to an actual web page, no LLM, no guesswork. Picture by writer*

Put the detector’s output subsequent to the web page it learn and you may examine it by eye. The Framework’s contents web page lists every part, then a Listing of Figures and a Listing of Tables; the detector recovers all three teams, titles and goal pages matching line for line.

*Left, the doc’s personal contents web page; proper, what the detector returns. Picture by writer*

That is the case to hope for. It’s deterministic, it’s actual, and the web page mapping is solved by the doc itself. The catch is that the majority paperwork that lack a local define additionally lack clickable hyperlinks, which takes us to the more durable case.

4. Learn the printed contents web page, then discover its actual pages

Case 3 is the frequent one: a printed desk of contents with no hyperlinks behind it, a web page headed “Contents” or “Desk of contents”, a column of titles, a column of web page numbers, usually joined by dot leaders. FIPS 202 has precisely this. A human reads it at a look. Parsing it has two distinct steps, and the second is the one individuals skip.

4.1 Detecting and studying the contents web page

First, discover the contents web page. The sign that really separates a contents web page from prose is dot-leader density: a number of traces of the form Some title .......... 42. A key phrase like “contents” raises confidence however just isn’t required, and by itself is a weak sign (a sentence can say “desk of contents”). The reader works on line_df alone, so it’s engine-agnostic.

In: line_df. Out: entries with a title and a displayed_page, the web page quantity as printed on the road.

import re
# "Introduction ......... 12"             "Introduction       12"
DOTTED   = re.compile(r"^(.*?S)[.…](?:[.…s]){2,}(d{1,3})$")
TRAILING = re.compile(r"^(.{2,70}?S)s{2,}(d{1,3})$")

def extract_toc_from_contents(line_df):
    entries = []
    for web page in find_contents_pages(line_df):    # pages dense in dot leaders
        for line in lines_of(line_df, web page):
            m = DOTTED.match(line) or TRAILING.match(line)
            if m:
                title, label = m.group(1).strip(), int(m.group(2))
                entries.append({"title": title,
                                "displayed_page": label,      # printed label
                                "stage": infer_level(title)}) # "2.3.1" -> 3
    return entries

4.2 The label just isn’t the web page

Right here is the subtlety. The contents web page says Introduction .... 1. Web page 1 of the file is the duvet, not the introduction. A entrance matter of a canopy, a foreword and the contents web page itself sits in entrance, so the printed label and the bodily web page reside in several numbering areas. Open the file to the bodily web page that the label names and also you land a number of pages early, each time.

So a printed web page quantity is a label, and it goes into displayed_page. Mapping it to the bodily start_page is a second step. A budget model assumes one fixed offset: bodily = displayed + shift. To search out the shift, pattern a handful of titles and check out each believable offset, retaining the one beneath which essentially the most titles truly seem on their shifted web page.

def infer_page_shift(line_df, entries, max_shift=40):
    """Finest fixed offset: physical_page = displayed_label + shift."""
    page_text = {p: text_of(line_df, p) for p in pages(line_df)}
    pattern = [(e["displayed_page"], norm(e["title"])) for e in entries][:20]
    best_shift, best_score = 0, -1
    for shift in vary(-max_shift, max_shift + 1):
        hits = sum(1 for label, title in pattern
                   if title in page_text.get(label + shift, ""))
        if hits > best_score:              # most titles land the place predicted
            best_score, best_shift = hits, shift
    return best_shift

*Printed labels 1, 2, 4, 7 map to bodily pages 4, 5, 7, 10 as soon as the front-matter shift is discovered. Picture by writer*

The identical factor occurs on an actual doc. FIPS 202 prints its contents web page on bodily pages 7 and eight, and its physique numbering begins nicely after the entrance matter. Run the detection and the alignment on it and the inferred shift comes out at +8: the introduction the contents web page calls web page 1 truly begins on bodily web page 9.

*Eight pages of entrance matter, so each printed label lands eight pages later within the file. Picture by writer*

Aspect by facet with the web page it learn, the 2 columns are the entire level. The label column reproduces what the contents web page prints; the web page column is the place every part truly begins within the file.

*Left, the doc’s personal contents web page; proper, what the detector returns, label and bodily web page. Picture by writer*

A relentless shift covers the frequent case. When numbering restarts partway by way of (an appendix that resets to 1, inserted plates), the offset just isn’t fixed, and the fallback is content material matching: find every title’s actual web page by fuzzy-matching its textual content in opposition to the physique, retaining the pages monotonically non-decreasing. align_toc_df runs the shift first and falls again to content material matching, so Case 3 arms the identical bodily start_page downstream as Case 2.

When the printed contents web page is simply too irregular for the patterns (a two-column structure, titles that wrap, leaders rendered as ragged whitespace), the LLM extractor takes over with a typed schema, studying the primary pages and returning the identical entry form. That could be a software of final resort for this case, not the default, as a result of a clear printed contents web page is affordable to learn and the LLM just isn’t. The LLM right here nonetheless solely reads the contents web page; it by no means invents a construction for a doc that has none.

5. The LLM disposes, it doesn’t detect

Each detection strategies are heuristics, and heuristics make errors: a hyperlink rectangle that swept up two titles, a contents line the patterns break up fallacious, a numbering that appears off. The reflex with an LLM is at hand it the entire doc and ask for a TOC. That’s the costly, least auditable choice. The higher division of labour is the inverse: the heuristic proposes a TOC, and the LLM solely checks whether or not it holds collectively.

from pydantic import BaseModel

class TocCoherenceVerdict(BaseModel):       # typed structured output
    is_coherent: bool
    points: checklist[str]

SYSTEM = ("A heuristic already proposed this TOC. Do NOT detect construction. "
          "Decide solely: is the numbering constant (no unexplained skips), "
          "are the web page numbers non-decreasing, does the hierarchy type a "
          "wise tree?")

def check_toc_coherence(toc_df):
    view = "n".be a part of(f"[{r.start_page}] {'  ' * (r.stage - 1)}{r.title}"
                     for r in toc_df.itertuples())
    return llm_parse(enter=[{"role": "system", "content": SYSTEM},
                            {"role": "user", "content": view}],
                     text_format=TocCoherenceVerdict, label="toc.coherence")

That is sooner, cheaper, and extra auditable than full-LLM extraction, and it degrades gracefully: if the LLM is unavailable, the heuristic TOC continues to be usable with a confidence penalty.

6. One uniform toc_df, no matter fired

The purpose of the cascade is that downstream code by no means learns which case ran. Whether or not the TOC got here from hyperlinks, a printed contents web page or the LLM, it leaves by way of the identical canonicaliser and arrives because the toc_df Article 5B outlined, with two columns added: displayed_page (the printed label, for audit) and supply (which technique fired).

DETECTORS = {"hyperlinks":         extract_toc_from_links,     # Case 2
             "contents_text": extract_toc_from_contents,  # Case 3
             "llm":           extract_toc_by_llm}         # onerous structure

def reconstruct_toc_df(pdf_path):
    for technique in ("hyperlinks", "contents_text", "llm"):    # ascending value
        entries = DETECTORS[method](pdf_path)
        if not entries:
            proceed                                     # fall by way of
        toc_df = canonicalize(entries, supply=technique)   # one form out
        if technique == "contents_text":
            toc_df = align_to_physical_pages(toc_df)     # label -> web page
        return toc_df
    return empty_toc_df()       # no contents web page -> summarization's job

Calling it’s one import and one line. The returned body is identical toc_df Article 5B outlined, plus a supply column that data which case fired.

# NIST FIPS 202 prints a contents web page however ships no native define:
# Case 3 fires (contents_text), the label-to-page alignment runs, supply="contents_text".

toc_df = reconstruct_toc_df("knowledge/nist/NIST.FIPS.202.pdf")

toc_df.head()              # title, stage, start_page, end_page, displayed_page, supply
toc_df["source"].iloc[0]   # "hyperlinks" | "contents_text" | "llm"  -- which case fired

Run it throughout the 2 labored examples and the cascade routes every to the most affordable technique that works, whereas the caller sees one toc_df each time.

*Hyperlinks for the linked contents web page, textual content patterns for the printed one. Picture by writer*

7. How nicely does it work?

It’s price checking the reconstruction in opposition to floor fact. Take paperwork that do carry a local define, cover it, run the contents-page strategies, and rating the outcome in opposition to the native TOC. scripts/eval_toc_vs_native.py does this: recall (native entries recovered), precision (reconstructed entries which are actual), and the share of matched entries whose begin web page lands inside one web page of the native one.

the hyperlink reader is near-exact (the hyperlink goal is authoritative); the text-pattern reader is softer, studying a printed web page and aligning labels is genuinely more durable – Picture by writer

The hyperlink case is near-exact as a result of the hyperlink goal is authoritative; the textual content case is softer as a result of studying a printed web page and aligning labels is genuinely more durable. Discover the hyperlink reader’s recall swings with the doc (86% on SP 800-30r1, 45% on SP 800-207, the place many entries are usually not hyperlinks), whereas its precision stays excessive: what it does get better, it locations appropriately. Neither technique is magic, and the coherence examine is there to catch the misses.

Conclusion

A PDF with no native define just isn’t a useless finish so long as it prints its personal contents web page. Case 1 reads the define the file ships. Case 2 follows clickable hyperlinks and will get the bodily web page without spending a dime. Case 3 reads the printed contents web page, then does the step most individuals skip, mapping the printed label to the true web page. The cascade tries them least expensive first and stops on the first that works, the LLM checks coherence relatively than doing the detection, and all the things leaves as the identical toc_df. A doc that prints no contents web page in any respect is a special drawback, summarization, which builds the construction from the physique. Article 7 (retrieval) picks that toc_df again as much as scope solutions by part.

Earlier within the sequence:

Doc Intelligence: sequence intro. What the sequence builds, brick by brick, and in what order.
Baseline Enterprise RAG, from PDF to highlighted reply. The four-brick pipeline finish to finish: PDF in, highlighted reply out.
Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval. The place embedding similarity wins (synonyms, typos, paraphrase), the place it predictably breaks (unknown phrases, negation, term-vs-answer relevance), and learn how to use it anyway.
Rerankers Aren’t Magic Both: When the Cross-Encoder Layer Is Well worth the Value. What a cross-encoder provides over bi-encoder embeddings, measured, and when it’s well worth the latency.
RAG just isn’t machine studying, and the ML toolkit solves the fallacious drawback. Why chunk-size sweeps and finetuning optimize the fallacious factor; route by query kind as an alternative.
From regex to imaginative and prescient fashions: which RAG method matches which drawback. Two axes, doc complexity and query management, that choose the method for every case.
10 frequent RAG errors we hold seeing in manufacturing. Ten manufacturing errors, organized brick by brick, with the repair for every.
Past extract_text: the 2 layers of a PDF that drive RAG high quality. The primary half of the parsing brick: the doc’s nature, indicators, and abstract.
Cease returning flat textual content from a PDF: the relational form RAG wants. The second half of the parsing brick: the relational tables each downstream brick reads.

Reconstructing the Desk of Contents a PDF Forgot to Ship, So RAG Can Scope by Part

Monitor and debug generative AI inference with SageMaker detailed metrics and Insights dashboard on CloudWatch

Amazon Bedrock AgentCore harness is now usually obtainable: Go from concept to production-grade agent in minutes

Amazon Bedrock AgentCore harness is now usually obtainable: Go from concept to production-grade agent in minutes

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

How Cursor Really Indexes Your Codebase

Construct a serverless audio summarization resolution with Amazon Bedrock and Whisper

Context Engineering — A Complete Fingers-On Tutorial with DSPy

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

About Us

Category

Recent Posts