Cease Returning Flat Textual content from a PDF: The Relational Form RAG Wants

brick of Enterprise Doc Intelligence, a collection that builds an enterprise RAG system from 4 bricks: parsing, query parsing, retrieval, and era. Parsing comes first, and that is the second of its two components. The earlier half turned a PDF into line_df, one row per line of textual content on the web page. This one covers the remainder of the mannequin: the total set of tables a parser ought to emit, what every one holds, and the way they hyperlink collectively, so the desk on web page 14 retains its columns and the renewal payment stays connected to its label. The opposite three bricks, and the highlighted reply on the finish, all learn these tables, by no means the uncooked PDF.

*the place this text sits within the collection: Article 5, the data-model half of the parsing brick, inside Half II (the 4 bricks) – Picture by creator*

RAG tutorials begin the identical method: textual content = extract_text(pdf). That single line is the place the PDF issues start.

You construct a RAG pipeline. It really works on a number of clear paperwork. Then a buyer sends you an actual contract: 30 pages, with a Schedule of Costs desk on web page 14. The consumer asks “what’s the renewal payment?” and the mannequin returns the mistaken quantity.

The group says: “the mannequin can’t learn tables.”

The mannequin reads tables high-quality. The issue is upstream. Your parser walked the desk cell by cell and joined them into one lengthy string. The column construction is gone. The hyperlink between a label and its quantity is gone. Your mannequin is requested to guess which quantity is the renewal payment. Generally it guesses proper. Typically it doesn’t.

*The identical 4 rows, joined cell-by-cell into one chunk. EUR 200 one-time, Late fee 75: who pairs with what? – Picture by creator*

The parser didn’t fail. It gave you what you requested for. You requested for the mistaken factor.

A great PDF parser doesn’t extract textual content. It fashions the doc as a relational set of tables. One PDF in, one desk per type of factor out (seven or eight right now, and extra as new wants present up).

toc_df: the sections, just like the creator wrote them.
page_df and line_df: the physique. Each web page. Each line.
image_df: each determine on each web page.
span_df: daring, italic, colour, font dimension. Each span of each line.
object_registry: each determine caption, each desk caption, each annex.
cross_ref_df: each “see Determine 2”, each “see Desk 4”, each “see Annex B”.
parsing_summary: tells you if the PDF is born-digital, scanned, or combined. Tells you if the OCR is sweet or dangerous.

Retrieval reads these tables. Technology reads these tables. Highlighting reads these tables. You open the PDF as soon as. After that, you solely work with tables.

This text covers every desk intimately, then runs parse_pdf aspect by aspect on two very completely different PDFs to point out that the identical columns cowl each. The earlier article (“Past extract_text: the 2 layers of a PDF that drive RAG high quality”) covers the upstream aspect: the declared alerts the parser reads first and the page-level classification it runs earlier than any line will get a quantity.

how every desk is produced: line_df, parsing_summary, toc_df and image_df come straight from the parse; page_df, span_df, object_registry and cross_ref_df are derived from line_df – Picture by creator

1. One desk per entity

Every thing we’ve extracted will get returned as a dictionary of tables plus a parsing abstract, one desk per entity of the doc mannequin.

The _df naming conference makes granularity readable from the identify itself. The diagram on the prime of this text reveals how every desk is produced. 4 come straight from the parse: line_df (the textual content strains), parsing_summary (the doc-level synthesis), toc_df (the native define, through doc.get_toc), and image_df (through web page.get_image_info). The opposite 4 are derived from line_df: page_df aggregates it by web page, whereas span_df, object_registry, and cross_ref_df are extracted from its strains. How the tables then be a part of each other is a separate query, taken up in part 2.

1.1. toc_df: desk of contents

TOCs are in all places in enterprise paperwork. Contracts, stories, insurance policies, worker manuals, regulatory filings: virtually all of them ship with a declared part construction, and that construction is the most cost effective semantic sign you may hand a retriever.

The catch: it isn’t all the time native. Generally it’s solely typographic (daring headings, numbered sections, indented subheadings) and needs to be reconstructed from line_df + span_df.

We focus right here on the native case (the widespread one for born-digital LaTeX, Phrase, and InDesign exports); reconstructing a TOC from typography when bookmarks are absent is its personal matter, sketched by an adaptive parser and handled in full in a devoted follow-up.

*declared define with `parent_idx` and `breadcrumb`; empty when no native bookmarks – Picture by creator*

Find out how to construct it: build_toc_df(doc) calls doc.get_toc(easy=False) (one entry per bookmark, with the vacation spot dict connected) and walks the outcome to compute parent_idx, breadcrumb, end_page, and start_y. Run on the Consideration paper, you get the 22 entries already proven in part 1.2 above: three ranges of headings, native bookmarks, no reconstruction wanted.

The implicit end_page conference: TOCs mark the place sections start, virtually by no means the place they finish. build_toc_df materializes the top as a column anyway: for every row, end_page is the start_page of the following entry on the similar degree or shallower (the following peer or ancestor), with total_pages because the fallback for the final part. Have a look at Conclusion on the Consideration paper: start_page=10, end_page=15. The doc solely has 15 pages, so the final part absorbs every part to the doc’s finish. The conference retains a one-page overlap by design (a bit’s end_page is its successor’s start_page, not successor.start_page - 1), which makes the era brick’s next-page peek (a robust completeness sign that catches truncated lists at part boundaries) a single lookup moderately than a runtime scan.

The start_y column, for information: Every bookmark in a PDF define carries a vacation spot Level(x, y) on its goal web page, not only a web page quantity. build_toc_df exposes the y as start_y (uncooked worth as returned by fitz). It pins every part header to a exact place inside start_page, which is what allows line-level decision: the identical (target_page, target_y) → line be a part of used for native hyperlinks in part 1.6. Identical coordinate-orientation caveat: 720 on the Consideration paper (LaTeX, bottom-up) and 72 on NIST CSF (Acrobat, top-down) each level on the prime of the web page, simply from reverse origins. We retailer the uncooked worth; callers normalize when they should land on a particular line.

start_page and end_page are page-level anchors. Line-level anchors (start_line, end_line) are the pure refinement: they let downstream levels pinpoint a bit to the precise line in line_df, they usually allow TOC offset detection when the doc has entrance matter inserted after the TOC was generated (all the TOC drifts by 1 or 2 pages, a real-world failure mode). The total therapy lives in a devoted bonus article on TOC anchoring and validation; for now, toc_df stops at page-level granularity (with start_y because the bonus column for callers able to resolve to a line).

The position: toc_df is the most cost effective semantic sign in all the pipeline. Every entry names a bit: realizing that strains 100–150 belong to “3.5 Positional Encoding” tells the retriever and the LLM what these strains are about, earlier than any embedding is computed. Embeddings offer you topical proximity; the TOC provides you the doc’s personal structural which means of every area, declared by the creator, not inferred. The breadcrumb extends this with hierarchical context: a bit will get stamped with “Strategies > 3.5 Positional Encoding”, giving the language mannequin section-level grounding with out inflating the chunk textual content. end_page is what lets the era brick peek one web page previous a retrieved part and detect truncated solutions with no imaginative and prescient cross. When the doc has a local TOC, all of that is free.

Be careful: TOC entries can level to pages that don’t exist (a corrupt or truncated export). Validate 0 <= page_num < n_pages earlier than recording a row, or a bit anchor lands nowhere and the page-range be a part of from part 2 silently returns empty.

1.2. line_df: line granularity

The supply of reality for textual content content material. Each line of the PDF, with its place and dominant typographic fashion.

*one row per textual content line with bbox, typography, render mode, `column_position` – Picture by creator*

Find out how to construct it: fitz_pdf_to_line_df(pdf_path) walks each textual content block of each web page and emits one row per line. assign_column_positions(line_df) then annotates every row with single / left / proper / multi. Run on information/paper/1706.03762v7.pdf, the Consideration Is All You Want paper (Vaswani et al. 2017; arXiv non-exclusive distribution license, declared on the arXiv summary web page). Right here is web page 4 of the paper (the two-column Determine 2 area):

*rows 1-2 are Determine 2’s twin captions in reverse columns – Picture by creator*

The position: line_df is the unified per-element manifest of the doc. Textual content strains first, however the identical row construction additionally carries picture placeholders and desk placeholders: every seen content material factor on a web page is one row, with its personal bbox, column_position, and a content_type flag (textual content, picture, desk). Textual content-specific fields (font, render_mode) are NaN for non-text rows; the wealthy picture and desk metadata lives in image_df and the desk extractor’s output, joined again through (page_num, line_num). The result’s {that a} single sorted question towards line_df.page_num returns each factor on a web page in studying order, no matter its sort. Downstream levels don’t have to affix three tables to know what’s on this web page.

Be careful: on multi-GB or thousand-page PDFs, holding each line (and picture) in reminiscence without delay is an issue. A light-weight mode that skips line_df and image_df for endpoints needing solely parsing_summary (classification, the doc-level abstract) retains these low cost; gate the total parse at ingestion time for the remaining.

The screenshot under is from Enterprise Doc Intelligence, the desktop app I’m constructing. The Textual content panel on the proper is line_df made seen: the web page’s native textual content, line by line, parsed as soon as and skim straight from the desk, subsequent to the unique web page it got here from.

*line_df made seen: the web page’s native textual content learn straight from the desk, beside the unique web page – Picture by creator*

1.3. page_df: web page granularity

Per-page synthesis. Classification, flags, aggregated metrics.

*per-page synthesis: `page_type`, additive flags, char counts, `n_columns` – Picture by creator*

Find out how to construct it: build_page_df(line_df) teams line_df by page_num. detect_columns_per_page(line_df) computes n_columns and the result’s merged in.

What else matches right here: build_page_df is the proper residence for any per-page sign you may combination from line_df on the identical cross. Past the core triplet, easy aggregations land right here without cost: n_lines (web page density), native_chars versus ocr_chars (a quick scanned-or-native verdict, no classifier wanted), n_fonts and font-size unfold (a tough construction indicator that separates heading-heavy pages from plain prose), image_coverage_ratio (a be a part of with image_df). The columns that want a downstream cross wait: page_type is produced by classify_page (lined within the earlier article) and parsing_method / context_structured are produced by an adaptive cascade that escalates to a heavier parser when fitz will not be sufficient.

Run on the Consideration paper:

*low cost aggregations subsequent to the core triplet on the Consideration paper – Picture by creator*

The position: page_df is the place extraction is anchored. Each parser, each OCR run, each classifier operates web page by web page; page_df is the desk that data what every web page is and the way it ought to be dealt with. The web page can also be semantic unit by itself: roughly one or two concepts per web page in tutorial papers, one clause per web page in contracts, one sub-topic per web page in technical stories. Sufficiently small to be targeted, massive sufficient to hold context. That’s why retrieval sometimes defaults to page-level chunks in a minimal RAG pipeline and why most downstream coordination keys off page_num. Once you question “what’s web page 5 about”, page_df is the row that solutions; once you question “all scanned pages with dangerous OCR”, page_df is what you filter.

Be careful: retailer page_width and page_height per row, by no means as soon as per doc. Letter and A4 combine in technical publishing, and a panorama web page is commonly inserted for a large desk; a single document-level web page dimension makes each bbox-derived metric (column detection, full-page-image protection) drift on the odd-sized pages.

1.4. image_df: picture granularity

One row per embedded picture.

Find out how to construct it: The parser walks each web page and calls web page.get_image_info(), which returns every embedded picture with its displayed bounding field and intrinsic dimensions. The Consideration paper has three:

*3 pictures: web page 3 Determine 1, web page 4 Determine 2’s two panels – Picture by creator*

Describing the picture content material: Up to now image_df solely locates every picture: a bounding field, a dimension, a content material hash. It says nothing about what the picture reveals, and a bounding field will not be retrievable. A chart or a diagram holds no extractable textual content, so OCR and layout-based parsers depart that half empty: to them the area is invisible. To make the determine searchable we run a imaginative and prescient LLM over every extracted picture and retailer a brief description alongside the row, for instance “a line chart of commodity costs since 2022” or “the Transformer structure, an encoder of N stacked layers”. That description is textual content, so retrieval can match it. A companion piece on vision-LLM enrichment walks this step in full.

*Every extracted picture will get a one-sentence description, which is textual content retrieval can match – Picture by creator*

1.5. object_registry: cross-reference TARGETS

A cross-reference has two sides. The goal is the place a named object lives within the doc: the road “Determine 2: The Transformer mannequin structure” on web page 3, the road “Desk 1: BLEU scores” on web page 8. The supply is a body-text point out pointing on the goal: “as proven in Determine 2”, “see Desk 1”. object_registry captures the goal aspect, one row per caption. The subsequent subsection (part 1.6) captures the supply aspect. Resolving sources to focus on pages, so a retrieved chunk that mentions “see Desk 1” additionally pulls the web page the place Desk 1 lives, is a follow-up cross-reference cross that consumes each tables.

*captions for named objects, one row per goal, `(object_type, object_id)` is the be a part of key – Picture by creator*

Find out how to construct it: Detection makes use of regex patterns ANCHORED at first of a line (an actual caption begins there, a body-text point out doesn’t); build_object_registry walks line_df, matches every line towards the patterns, and retains the primary hit for each (object_type, object_id) pair. On the Consideration paper:

OBJECT_PATTERNS = [
    (re.compile(r"^s*(?:Figure|Fig.?)s+(d+)b", re.IGNORECASE), "figure"),
    (re.compile(r"^s*Tables+(d+)b",             re.IGNORECASE), "table"),
    (re.compile(r"^s*(?:Annex|Appendix)s+([A-Z0-9]+)b", re.IGNORECASE), "annex"),
]

def build_object_registry(line_df: pd.DataFrame) -> pd.DataFrame:
    """Returns one row per (object_type, object_id), first match wins."""

Run on the Consideration paper, the builder lands one row per named object, with the caption line because the anchor:

*5 figures and 4 tables on the Consideration paper, every with its caption anchor – Picture by creator*

1.6. cross_ref_df: cross-reference SOURCES

The symmetric half of object_registry. Every row is one body-text point out of a named object: “as proven in Determine 2” on web page 4, “seek advice from Desk 1” on web page 7, “see Annex B for particulars” on web page 12. Each such point out is a supply that, when resolved, jumps to a web page recorded in object_registry.

Identical sample because the TOC, two strategies can produce these rows: native PDF hyperlinks (the deterministic supply, when the doc carries them) and text-pattern matching on line_df (the overall fallback, what build_cross_ref_df ships). Methodology 1 is precise however partial. Methodology 2 is approximate however full.

Methodology 1, native PDF hyperlinks: A PDF can carry its personal clickable cross-references. fitz.Web page.get_links() returns one entry per hyperlink rectangle, with the goal encoded as a (target_page, to.x, to.y) triple for an inner leap or a URI for an exterior one:

import fitz
doc = fitz.open("information/nist/NIST.CSWP.29.pdf")
for web page in doc:
    for ln in web page.get_links():
        tgt_page = ln.get("web page")
        tgt_pt   = ln.get("to")        # Level(x, y) on the goal web page
        print(web page.quantity + 1, ln.get("sort"), tgt_page, tgt_pt, ln.get("uri"))

The fascinating bit is to.y. Realizing solely the goal web page tells you the place on the doc the hyperlink lands however not what it factors at; the y coordinate pins the road inside that web page. We cut up the vacation spot into two scalar columns, tgt_page and tgt_y, and resolve the goal line by discovering the row in line_df whose y0 is closest to tgt_y on tgt_page.

Two sensible caveats right here:

PDF turbines differ on y orientation. LaTeX returns bottom-up, Acrobat returns top-down. The normalizer tries each and retains the nearer match.
tgt_y could sit between two strains. We spherical to the closest one.

The payoff: as soon as we all know the touchdown line, we are able to be a part of (target_page, landing_text) towards toc_df and get better the part index immediately. No regex, no textual content matching towards breadcrumbs. The native hyperlink tells us precisely which toc_idx we landed in.

*4 NIST TOC hyperlinks pointing at part begins, joinable to `toc_df` – Picture by creator*

The identical pipeline on the Consideration paper turns up a distinct form of hyperlink: citations that resolve to bibliography entries moderately than TOC part begins.

*3 Consideration-paper citations resolved to bibliography strains through `landing_text` – Picture by creator*

Protection is the catch. The 2 demo PDFs present the identical sample:

Consideration paper: 95 inner hyperlinks, all citations leaping to bibliography entries, plus 18 exterior URIs (github, arxiv). Zero native hyperlinks for body-text mentions like “as proven in Determine 2”.

NIST Cybersecurity Framework 2.0 (CSWP-29; US Authorities work, public area within the US, see NIST copyright assertion): 47 inner hyperlinks, all TOC entries and the record of figures pointing at part begins, plus 56 exterior URIs. Identical story: no body-text determine or desk mentions are linked.

Enterprise paperwork are normally worse, with no native hyperlinks in any respect (scans, screenshots, exports from instruments that drop hyperlink metadata). So native hyperlinks are glorious sign when current (deterministic, resolvable to a toc_idx when the goal is a bit header) however by no means cowl the total set of cross-references an article carries.

Methodology 2, text-pattern matching: Detection makes use of the identical vocabulary as OBJECT_PATTERNS, however UNANCHORED so the regex matches anyplace inside a line; caption strains are excluded so the road that DEFINES Determine 2 isn’t additionally counted as a point out of it.

*one row per body-text point out, joinable again to `object_registry` – Picture by creator*

On the Consideration paper:

REFERENCE_PATTERNS = [
    (re.compile(r"b(?:Figure|Fig.?)s+(d+)b", re.IGNORECASE), "figure"),
    (re.compile(r"bTables+(d+)b",             re.IGNORECASE), "table"),
    (re.compile(r"b(?:Annex|Appendix)s+([A-Z0-9]+)b", re.IGNORECASE), "annex"),
]

def build_cross_ref_df(line_df: pd.DataFrame) -> pd.DataFrame:
    """One row per body-text point out, with ~30 chars of context."""

Run on the Consideration paper, each body-text point out of a determine or desk lands as a row, joinable again to object_registry:

*6 of 13 mentions; Determine 2 seems thrice throughout pages 4-5 – Picture by creator*

Run on the demo PDFs, the Consideration paper has 13 body-text mentions protecting 6 distinctive objects (Determine 1, Determine 2, Desk 1–4): some figures are referenced a number of occasions, which is precisely what the source-side desk is supposed to seize.

NIST CSF 2.0 has 13 mentions (7 determine references, 5 annex references, 1 desk reference) protecting 10 distinctive objects (5 figures, 4 annexes, 1 desk). The mismatch with NIST’s object_registry (6 figures + 3 annexes + 2 tables) is informative:

one annex is talked about within the physique with out an anchored caption within the doc (the regex catches a reference whose goal lives exterior the parsed textual content)
one registered determine and one registered desk are by no means referenced

Each are real-world alerts price surfacing to a downstream cross-reference resolver.

1.7. span_df: sub-line granularity (optionally available)

The road is typically too coarse. A line can combine daring and non-bold textual content (an outlined time period in a contract). A line in a analysis paper can embrace an inline equation in italic alongside prose. A line in an modification can have the unique textual content in black and the modification in pink.

class Span(BaseModel):
    # Identification & ordering
    pdf_hash: str
    page_num: int
    line_num: int
    span_id:  int

    # What it says, the place it sits
    textual content: str
    bbox: tuple[float, float, float, float]

    # Typography alerts
    font_name: str
    font_size: float
    is_bold:   bool
    is_italic: bool
    color_rgb: tuple[int, int, int]

A span_df is extra granular than line_df. On the Consideration paper the ratio is 3,480 spans for 1,048 strains, about 3.3× heavier. The price solely pays off for levels that examine typography:

Heading detection: A line in a bigger font, probably daring, might be a heading. A TOC reconstruction cross makes use of this when native bookmarks are absent.
Itemizing detection: A daring span beginning a paragraph is commonly the marker of an enumeration merchandise.
Outlined phrases in contracts: Daring or italicized phrases in authorized paperwork are sometimes outlined elsewhere; capturing them at parse time allows glossary linking later.

Find out how to construct it: Default behaviour: parse_pdf(...) returns span_df empty. The downstream levels that want it name a devoted builder on the identical line:


paper = parse_pdf(paper_pdf)
paper["span_df"] = build_span_df(paper_pdf)   # 3,480 rows on the Consideration paper

Preserving the spans behind an express name avoids paying their price on each parse for levels that solely want line_df. Run on the Consideration paper:

*rows 1-5 physique textual content, rows 6-7 the part header in daring; `is_bold` keys the TOC reconstructor – Picture by creator*

1.8. parsing_summary: technical synthesis

A single JSON-serializable dictionary per doc. It solutions at a look: “is that this PDF scanned?”, “does it want OCR?”, “what extraction technique ought to the following stage use?” And yet another, the semantic one downstream bricks learn: “what sort of doc is that this and what’s it about?”

The dict is organised in 5 zones. The primary 4 are deterministic, constructed by the parser with out an LLM name. The fifth, semantic, carries the doc kind plus a brief LLM-written abstract that the query parser injects into its system immediate.

{
  "pdf_hash": "abc123...",
  "n_pages": 87,
  "pdf_version": "1.7",
  "source_software": "word_export",
  "creator_raw": "Microsoft Phrase 2019",
  "producer_raw": "Microsoft Phrase for Microsoft 365",
  "content_type": "scanned_with_ocr",
  "is_scanned": true,
  "has_text_layer": true,
  "ocr_quality": "good",
  "page_type_counts": {"scanned_ocr_good": 80, "native": 5, "empty": 2},
  "scanned_page_ratio": 0.92,
  "has_toc": true,
  "n_toc_entries": 24,
  "n_named_objects": 11,
  "is_encrypted": false,
  "has_form_fields": false,
  "recommended_strategy": "use_existing_ocr",
  "needs_reocr": false,
  "pages_needing_ocr": [],
  "doc_type": "annual_report",
  "typical_fields": ["fiscal_year", "revenue", "net_income", "auditor"],
  "abstract": "87-page annual report for fiscal yr 2023. Covers income, internet earnings, and auditor's notes throughout working segments. Commonplace sections: Letter to Shareholders, MD&A, Monetary Statements, Notes."
}

The excellence between source_software (from metadata) and content_type (inferred from content material) issues. The 2 can diverge: a PDF whose Producer is “Microsoft Phrase” however whose content material is 100% scanned means someone pasted pictures right into a Phrase doc and exported. That’s helpful info; don’t overwrite one with the opposite.

The semantic zone follows the identical rule on a distinct axis. doc_type is a rough household (resume, contract, academic_paper, bill, memo, annual_report, …) derived from filename + first-page textual content. Deterministic, no LLM. typical_fields is the per-doc_type desk of area names a query about this sort of doc is more than likely to focus on; a resume will get [name, email, phone, experience, …], a contract will get [policyholder, premium, deductible, …]. abstract is the one LLM-derived worth within the dict: three to 4 factual sentences naming the doc kind, the principle topic, and the fields it carries. One LLM name at parsing time, cached perpetually, injected into the query parser’s system immediate so “what’s the identify?” on a CV not returns not discovered. The companion article on what to learn earlier than any line will get a quantity (“Past extract_text”) walks the total design of that abstract.

2. The relational mannequin: how the tables hyperlink

Producing the tables is one factor; linking them is one other. As soon as the tables exist, the keys they share flip eight separate DataFrames into one queryable mannequin, and virtually each hyperlink resolves again to line_df, the per-line supply of reality.

*how the tables be a part of: line_df on the centre, every desk linked by its shared key – Picture by creator*

A couple of hyperlinks carry many of the weight:

toc_df → line_df. A TOC entry is aware of its start_page (and start_y), so from any part you leap straight to the strains that belong to it. “Summarize part 3.5” turns into a page-range filter on line_df, no search required.
image_df ↔︎ line_df. A picture occupies a place on the web page, so it has a line slot in line_df. That line’s textual content is empty at first, since a picture carries no extractable textual content. Optionally, a imaginative and prescient cross reads the picture and writes a brief description again into that textual content cell, so retrieval can match “the structure diagram” later. The hyperlink is what makes that enrichment incremental: fill it once you want it, depart it empty once you don’t.
cross_ref_df → its goal. A body-text point out resolves to wherever the goal lives. “see Determine 2” resolves to object_registry on (ref_type, ref_id); “see part 2.3” resolves to a toc_df entry. The desk fills in as references are matched, so decision runs lazily, point out by point out.
page_df, span_df, object_registry anchor to line_df on page_num or (page_num, line_num), the identical be a part of each downstream brick depends on.

Concretely, widespread questions collapse into one or two filters:

“Summarize part 3.5.” Lookup its start_page and end_page in toc_df, then line_df[line_df.page_num.between(start, end)]. No embedding, no key phrase search, simply the part’s strains.
“What are the totals?” on the bill from part 3.2 → line_df[line_df.column_position == "right"]. The column the parser detected is now a question.
“What does Determine 2 present?” object_registry resolves the caption to its web page and line; line_df returns the caption textual content; and if a imaginative and prescient cross has crammed the picture’s slot, you get the outline too.
“The place is Desk 1 referenced?” cross_ref_df[(cross_ref_df.ref_type == "table") & (cross_ref_df.ref_id == 1)] lists each point out with its (page_num, line_num), joined again to toc_df to call the part every one sits in.

Every is a filter or a be a part of on tables already in reminiscence, by no means a re-parse.

That is what the joins purchase you downstream. Retrieval pulls a bit from toc_df, expands it to its strains in line_df, and widens to the figures it mentions by object_registry; era reads these strains; highlighting renders citations again onto the web page by (page_num, line_num). The entire pipeline turns into a sequence of low cost joins on one parse, as a substitute of re-reading the PDF at each step. How these joins develop into concrete SQL main keys, overseas keys, and indexes is the storage layer’s job, past this text’s scope.

3. parse_pdf on two actual PDFs, aspect by aspect

parse_pdf is the one entry level that calls each helper above and returns the total set of linked tables in a single go. Run it on two very completely different PDFs and the output construction is an identical: similar keys, comparable shapes.

3.1. parse_pdf side-by-side on two actual PDFs

Working each calls and laying the 2 returned dicts aspect by aspect reveals that the keys maintain up, with per-cell tallies that mirror every doc’s form:

*similar keys, similar form throughout paperwork, with per-cell tallies – Picture by creator*

A LaTeX analysis paper and the NIST Cybersecurity Framework 2.0 (CSWP-29, US authorities work, public area). Two very completely different paperwork: one has 15 pages of math notation in a NeurIPS-style two-column format, the opposite 32 pages of coverage textual content mixing single and two-column sections. Identical parse_pdf name, similar keys, each column comparable. The Consideration paper drops a helpful shock on the way in which: this arXiv model carries 22 native TOC entries, opposite to the widespread assumption that arXiv strips bookmarks.

The PDF is opened as soon as with fitz, each helper consumes the identical doc state, and the file is closed earlier than returning. No reopening, no redownload from S3, no inconsistency between two helpers seeing completely different web page variations. From right here, retrieval, era, and annotation by no means contact the PDF once more. They question the dict.

3.2. column_position in motion (an bill)

Invoices are the canonical case for column_position: line gadgets run down the left column (descriptions), costs and totals stack down the proper column. We choose a one-page fictional bill (information/invoices/invoice_01.pdf, openly-licensed, generated for the collection) so the format is sincere two-column billing as a substitute of a analysis paper’s determine caption.

*every line boxed by the column the parser assigned: blue = left (descriptions), inexperienced = proper (quantities and totals) – Picture by creator*

Have a look at the supply web page first. Every line is boxed by the column the parser gave it: blue for the left (descriptions), inexperienced for the proper (quantities and totals). assign_column_positions picks that cut up cleanly:

header line on the left at x0 ≈ 54, totals stack on the proper at x0 ≈ 391-514, a line merchandise splits an outline on the left and amount + worth on the proper on the similar y0 – Picture by creator

The header line sits within the left column at x0 = 54. Under the gadgets desk, the totals stack on the proper: “TOTAL DUE:” at x0 ≈ 391, the quantity $2,027.56 at x0 ≈ 497. The road merchandise at y0 = 397.13 reveals the cut up clearly: the outline “Workers coaching” sits at x0 = 54 (left), the amount 0.5 and unit worth $197.58 sit at x0 ≈ 343 and x0 ≈ 395 (proper). Downstream, asking for “the totals” turns into a one-line question towards line_df: line_df[line_df["column_position"] == "proper"].

No imaginative and prescient cross, no bbox arithmetic. Only a column filter on a structured desk.

3.3. Two PDFs, similar parser, similar form

Two very completely different paperwork, the identical parser, immediately comparable structured outputs:

*ame parser, two PDFs; strains, columns, TOC entries, named objects all queryable – Picture by creator*

What this could have seemed like with a naive get_text() parser: a string per doc, no strategy to inform which strains had been OCR’d and which had been native, no thought the place every determine caption sits, no separation between left and proper halves of a two-column web page. The retrieval and era levels would have constructed on sand.

4. Save as soon as, reload perpetually

Parsing is the costliest brick within the pipeline. Query parsing, retrieval, and era every price one LLM name; parsing reads bytes and resolves format. With PyMuPDF it stays low cost (sub-second on a small paper). With heavier engines (Azure Structure, Tesseract, vision-LLM fallback), the identical PDF can take 30 seconds to a number of minutes per run. Three iterations on a downstream immediate is three OCR runs. No motive for that.

The repair is path-driven. Every PDF writes its parsed tables to a mirror folder below the output listing, matching the supply path precisely. From the PDF path alone, each downstream step (retrieval, era, annotation) is aware of the place the cache lives.

*each PDF in `information/` has a twin folder in `output/` carrying its parsed tables – Picture by creator*

The relational tables go to .xlsx (one file per desk, opens with a double-click), parsing_summary to JSON. Excel is sufficient at this stage: pandas round-trips cleanly, and every desk stays inspectable in any spreadsheet device. A manufacturing storage layer swaps in SQLite (overseas keys, joins throughout paperwork, append-on-update), however the downstream bricks devour DataFrames both method.

save_parsed writes the folder; load_parsed returns the identical dict, or None if the cache is lacking. The calling sample is one line:


parsed = load_parsed(pdf_path)
if parsed is None:
    parsed = parse_pdf(pdf_path)
    save_parsed(pdf_path, parsed)

The downstream bricks comply with swimsuit. Query parsing writes its ParsedQuestion to questions//parsed_question.json, retrieval saves retrieved_pages.xlsx, era saves reply.json. Each step is absolutely recoverable from disk, each step could be replayed with out touching the LLM once more. Once you tweak a era immediate, you’re not paying for parsing or retrieval to re-run.

5. Conclusion

A great RAG parser doesn’t extract textual content. It turns an unstructured PDF right into a relational mannequin of the doc: a set of linked tables, joined by shared identifiers (page_num, line_num, (ref_type, ref_id)), every carrying one entity. Retrieval, era, and annotation by no means re-read the PDF afterwards; they question DataFrames. Saving the parse as soon as and reloading it perpetually turns a 30-second-per-question latency right into a per-corpus one-shot price.

A relational set of tables, one PDF in, no flat string out. Each downstream device the group wires onto the parser (key phrase search, embedding similarity, part retrieval, quotation rendering, audit log, change monitoring) reads from these tables moderately than from the unique bytes. The PDF is opened as soon as, at ingest. After that, every part is SQL or pandas. That property is what makes the parsing brick definitely worth the engineering funding: the fee is paid as soon as per doc, and each iteration on the remainder of the pipeline runs towards a secure, queryable artefact.

This text is a part of the Enterprise Doc Intelligence collection. The minimal RAG pipeline reveals the relational tables in use end-to-end on an actual PDF.

6. Sources and additional studying

Earlier within the collection:

The parser this text describes follows the identical structure as Docling (Auer et al., Docling Technical Report, IBM Analysis 2024): format detection, TableFormer, reading-order. Borderless desk extraction makes use of the mannequin from Smock et al. (PubTables-1M / Desk Transformer, CVPR 2022). The page-class taxonomy is constructed on the identical baseline as Pfitzmann et al. (DocLayNet, KDD 2022). The article provides a render-mode detection cross (native / scanned / combined) with OCR-quality scoring on prime. The parser produces a relational set of tables (line_df, page_df, image_df, toc_df, object_registry, cross_ref_df, span_df, plus a parsing_summary dict); retrieval, era, and annotation downstream don’t learn the PDF once more, they question DataFrames.

Identical path because the article:

Auer et al., Docling Technical Report, IBM Analysis 2024 (arXiv:2408.09869). Reference structure for the pipeline this text describes: format detection, TableFormer, reading-order, unified doc illustration.
Smock, Pesala, Abraham, PubTables-1M / Desk Transformer (TATR), CVPR 2022 (arXiv:2110.00061). Imaginative and prescient-based desk detection and construction recognition; the mannequin behind most fashionable desk parsers.
Pfitzmann et al., DocLayNet, KDD 2022 (arXiv:2206.01062). Empirical baseline for the page-class taxonomy and format detection benchmarks.
Lo et al., PaperMage, EMNLP 2023 demos. Maps to the indexing-vs-reading cut up (parsing for retrieval will not be parsing for reply era).

Totally different angle, completely different context:

Faysse et al., ColPali: Environment friendly Doc Retrieval with Imaginative and prescient Language Fashions, 2024 (arXiv:2407.01449). Imaginative and prescient-language retrieval on the web page picture. The context is retrieval the place the web page picture is the artefact, no parsing-into-tables step. This text makes use of bounding-box-anchored DataFrames as the inspiration as a substitute.
Wang et al., DocLLM: A Structure-Conscious Generative Language Mannequin for Multimodal Doc Understanding, JPMorgan 2024 (arXiv:2401.00908). Structure-aware LLM that reads the PDF immediately with out an express relational parsing brick. Identical household of strategy as ColPali; completely different from this text’s queryable relational artefact.
Kim et al., OCR-free Doc Understanding Transformer (Donut), ECCV 2022 (arXiv:2111.15664). Finish-to-end OCR-free doc understanding; helpful distinction with the OCR-quality-scoring cross this text provides on prime of the render-mode detection.

Cease Returning Flat Textual content from a PDF: The Relational Form RAG Wants

Consider AI brokers systematically with Agent-EvalKit

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

How Cursor Really Indexes Your Codebase

Construct a serverless audio summarization resolution with Amazon Bedrock and Whisper

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

Context Engineering — A Complete Fingers-On Tutorial with DSPy

About Us

Category

Recent Posts