Baseline Enterprise RAG, From PDF to Highlighted Reply

quickest method to perceive what RAG is is to construct the smallest model that truly works, run it on an actual doc, and look intently at what simply occurred.

That’s this text. A couple of hundred traces of Python (no vector database, no framework, no brokers) operating on the Consideration Is All You Want paper (Vaswani et al. 2017; arXiv non-exclusive distribution license, declared on the arXiv summary web page), returning a sourced reply with the precise supply traces highlighted on the web page.

Then we stroll again by means of every block and ask the query it naturally raises. Every query is what a later article develops.

The minimal pipeline is the smallest quantity of code that respects the 4 bricks and produces a verifiable reply. Each later article provides functionality the group wants after a selected failure on actual paperwork, not as a result of the structure wanted extra layers.

1. What we’re constructing

The pipeline has 4 bricks (Half II goes into every one intimately) plus a remaining, non-compulsory rendering step. Every brick says what it takes in and what it offers again; what we go from one brick to the subsequent is what we save.

Doc parsing takes a PDF path and returns line_df (one row per textual content line, with page_num, line_num, textual content, and the bounding field) plus page_df. The minimal model holds each in reminiscence; greater methods persist them (Article 23 covers when to maneuver to a database).
Query parsing turns the consumer’s query right into a ParsedQuestion carrying the normalized query plus a brief listing of checked key phrases. It stays slim on goal: no retrieval logic right here, no query embedding.
Retrieval consumes the ParsedQuestion and emits top-k web page numbers (and, when wanted, the matching line numbers inside these pages). Conserving the handoff to web page numbers solely retains it small; the subsequent step rebuilds the filtered traces from line_df on the spot. The query embedding lives on this brick as a result of it is determined by the corpus index.
Technology brings collectively the query, line_df, and the retrieved web page numbers, and produces an AnswerWithEvidence: a typed JSON carrying the reply, the proof span (start_page, start_line, end_page, end_line), a confidence, a justification, the precise quotes from the supply, and any caveats. The total JSON is value saving for analysis, audit, and replay.
PDF annotation is non-compulsory. Given the supply PDF and the proof span, it writes an annotated PDF with rectangles drawn across the cited traces. A CLI software, a batch job, or an API client can skip it; the reply with citations is already full after era.

The primary 4 are the 4 bricks (Article 5 develops doc parsing, Article 6 query parsing, Article 7 retrieval, Article 8 era). PDF annotation is the rendering step, not a brick in itself.

*The baseline RAG pipeline, finish to finish – Picture by creator*

A PDF and a query go in. Every brick turns its enter into one thing extra structured: doc parsing turns the PDF into rows, query parsing turns the query into search-ready key phrases, retrieval cuts the rows down to some web page numbers, era produces a typed reply, and PDF annotation attracts the cited traces again onto the supply. What comes out just isn’t a chatbot bubble. It’s a sourced JSON reply plus an annotated PDF you’ll be able to open and examine.

The dependencies are minimal:

pymupdf parses PDFs into textual content plus place data; the bounding bins it returns are what we use to focus on the reply again on the supply web page.
openai is the LLM consumer; through base_url the identical library serves Azure, OpenRouter, Ollama, or any appropriate endpoint.
pandas holds the doc as a DataFrame, the format each parsing and retrieval step makes use of.
pydantic defines the reply schema that forces structured JSON with citations.

No vector database, no orchestration framework, no specialised RAG library. Later articles have a look at when these libraries’ helpers change into helpful, and after they get in the way in which of seeing what’s happening.

“For a 15-page paper, the LLM can learn the entire thing. Why hassle with retrieval?” Truthful level on this one doc. We use the paper to show the strategy, to not save tokens on these 15 pages. The objection typically factors to the Needle in a Haystack benchmark (Kamradt, 2023), the place frontier fashions rating near-perfectly retrieving a single verbatim sentence from a 1M-token context.

That benchmark is analysis, not observe. A needle is one remoted, verbatim truth, whereas enterprise questions mixture (“each contract whose deductible exceeds €5,000”), evaluate (“clause 12 throughout these three insurance policies”), or summarize throughout many passages. None of these is a single sentence to search out.

Two extra sensible causes maintain retrieval within the loop. Enterprise paperwork are sometimes lengthy:

a 300-page insurance coverage contract,
a 500-page regulatory submitting,
a multi-volume technical specification.

Sending the entire thing to the LLM prices actual cash on each query, each rerun, each consumer, and dilutes its consideration throughout irrelevant pages.

And the identical query runs throughout a whole bunch or 1000’s of paperwork directly:

“discover each contract that excludes earthquake injury”,
“summarize this 12 months’s regulatory modifications throughout all filings”.

At that scale, “throw all of it in” stops being a technique. Retrieval is what makes the pipeline survive each strikes: from one brief paper to 1 lengthy contract, and from one doc to a complete corpus.

2. The 4 bricks, and a PDF spotlight

Every step declares its inputs and outputs, and the steps are unbiased. The output of step N is the enter of step N+1, saved as a named DataFrame so any step might be re-run by itself in opposition to the saved output of the earlier one. Within the AI-coding period, an assistant instructed to “repair retrieval” can quietly modify the query parser when it ought to have stayed untouched. Unbiased modules are how you’re employed confidently on one piece with out breaking the remainder.

The setup chunks under load them alongside the OpenAI consumer.

Each brick that talks to a mannequin wants a configured consumer. The collection makes use of OpenAI’s Python SDK; any supplier that exposes an OpenAI-compatible endpoint (Azure OpenAI, vLLM, llama.cpp’s --api-server, …) drops in by altering base_url and the mannequin identify.

import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

consumer = OpenAI(
    api_key=os.getenv("API_KEY"),
    base_url=os.getenv("BASE_URL"),
)
model_chat = os.getenv("MODEL_CHAT", "gpt-4.1")
model_embed = os.getenv("MODEL_EMBED", "text-embedding-3-small")

2.1 Doc parsing

We extract each textual content line of the PDF together with its place on the web page. The output is a DataFrame the place every row is one line, with page_num, line_num, the textual content itself, and the 4 bounding-box coordinates x0, y0, x1, y1.

In: a PDF path.

Out: line_df (one row per textual content line, with page_num, line_num, textual content, and the bounding field) plus a page_df we’ll construct in part 2.3.

The bounding bins matter: they’re what we use to attract highlights on the supply PDF on the finish.

def fitz_pdf_to_line_df(file_path):
    doc = fitz.open(file_path)
    knowledge = []
    for page_num in vary(len(doc)):
        web page = doc[page_num]
        blocks = web page.get_text("dict").get("blocks", [])
        line_num = 0
        for block in blocks:
            if block.get("kind") != 0:
                proceed
            for line in block.get("traces", []):
                spans = line.get("spans", [])
                if not spans: proceed
                textual content = "".be a part of(s["text"] for s in spans)
                rect = fitz.Rect(spans[0]["bbox"])
                for span in spans[1:]:
                    rect |= fitz.Rect(span["bbox"])
                knowledge.append({
                    "page_num": page_num + 1,
                    "line_num": line_num + 1,
                    "textual content": textual content,
                    "x0": float(rect.x0), "y0": float(rect.y0),
                    "x1": float(rect.x1), "y1": float(rect.y1),
                })
                line_num += 1
    return pd.DataFrame(knowledge)

Working line_df = fitz_pdf_to_line_df(pdf_path) on the Consideration paper returns 1048 traces throughout 15 pages.

*First 5 rows of line_df with web page, line quantity, textual content, and bounding field – Picture by creator*

The paper, was rows. Every line is one row, with its textual content and the 4 numbers that find it on the web page. The x0, y0, x1, y1 columns don’t imply a lot but; in part 2.5 they’re what we use to attract rectangles on the supply PDF, precisely over the traces the mannequin cited.

This DataFrame, line_df, is the core knowledge construction of the remainder of the collection. Article 5 introduces a richer relational mannequin round it (line_df, chunk_df, toc_df, page_df, image_df).

What this parser doesn’t do: detect tables (Desk 1 web page 4, Desk 3 web page 9 flatten into plain traces), reconstruct headings, footnotes, cross-references, or deal with multi-column layouts. None of this issues for the query we ask right here. For different questions on the identical paper, it is going to. Article 5 covers parsing in full.

2.2 Query parsing

Earlier than the query goes to retrieval, we run it by means of a tiny LLM name. The objective is to extract the key phrases most helpful for looking the doc: brief phrases the doc is probably going to make use of, not essentially the literal phrases of the query.

In: a textual content query.

Out: a ParsedQuestion holding the normalized query and a brief listing of checked key phrases.

This step doesn’t find out about retrieval. It doesn’t compute the query embedding both. That one is tied to the corpus index and lives in part 2.3. Hold that line clear and you may swap the embedding mannequin or add a hybrid retriever tomorrow with out touching query parsing.

Why hassle on a minimal pipeline? Two causes:

You possibly can clarify why retrieval picked what it picked. When the system solutions improper, we will see whether or not the key phrases had been off (question-parsing downside) or the suitable key phrases landed on the improper web page (retrieval downside). With out query parsing, retrieval is a black field.
The query is an actual enter, identical to the doc. Part 2.1 parsed the doc into line_df. This subsection parses the query into ParsedQuestionMinimal. Each inputs should be parsed earlier than they hit the search step. Article 6 builds the richer brick (parse_question, with reply form, scope filters, decomposition, …).

On the query “What are the choices talked about for positional encoding?”, the decision parsed_question = get_keywords_from_question(query, consumer=consumer) returns parsed_question.key phrases = ['positional encoding', 'options', 'mentioned'].

query = "What are the choices talked about for positional encoding?"
parsed_question = get_keywords_from_question(query, consumer=consumer)
print(parsed_question.key phrases)

['positional encoding']

The LLM produces a single, literal phrase like ['positional encoding']. That’s deliberate. An earlier draft of this immediate requested for “3 to five brief key phrases helpful for looking”, and the LLM fortunately stuffed the quota with paraphrases (positional encoding choices, forms of positional encoding, transformer positional encoding). None of these are written within the doc. Solely positional encoding is. Substring matching is strict: a single lacking phrase kills the match. The minimal model asks the LLM to do much less (extract the literal noun phrase, drop the query framing) and trusts the subsequent block to do the remainder.

What this minimal model doesn’t do:

detect an answer_shape (Q&A vs summarization)
decompose compound questions
pull from a website glossary
connect retrieval hints

All lined in Article 6, underneath the richer parse_question brick. Right here we maintain two fields, corrected_question and key phrases, the smallest model that makes the brick seen.

Be aware: overriding the system immediate. get_keywords_from_question exposes the system immediate as a kwarg with KEYWORDS_PROMPT as default. To check a variant (totally different area, stricter guidelines, additional examples), go system_prompt=... on the name web site. No edit to the operate. Identical sample for each LLM helper in docintel (llm_answer_with_evidence exposes each system_prompt and user_template). Beneath: the identical name, run twice on a contract-style query. First with the research-paper default, which stays generic. Then with a contract-domain immediate, which picks up insurance coverage vocabulary like exclusions, deductible.


demo_question = "Are earthquakes excluded from protection?"

# Default: research-paper immediate.
parsed_question_default = get_keywords_from_question(demo_question, consumer=consumer)
print("Default (research-paper):", parsed_question_default.key phrases)

# Override: insurance coverage / authorized contract immediate.
contract_prompt = (
    "Extract 1 to three brief key phrases from the consumer query for looking an "
    "insurance coverage contract or authorized coverage. Choose literal phrases the contract is "
    "seemingly to make use of: clauses, exclusions, named perils, deductibles, caps. Drop "
    "query framing phrases. Output 1 to three key phrases."
)
parsed_question_contract = get_keywords_from_question(
    demo_question, system_prompt=contract_prompt, consumer=consumer,
)
print("Contract immediate:        ", parsed_question_contract.key phrases)

Default (research-paper): ['earthquakes', 'coverage']
Contract immediate:         ['earthquakes', 'exclusions', 'coverage']

2.3 Retrieval

Sending all 1048 traces to the LLM works on a paper this measurement however doesn’t scale and dilutes the mannequin’s consideration. We minimize the doc all the way down to the few pages almost definitely to comprise the reply.

In: the checked key phrases (and/or the normalized query, relying on the strategy) from part 2.2.

Out: the top-k web page numbers, plus optionally the matching line numbers inside these pages.

The query embedding is computed right here, not in part 2.2, as a result of an embedding solely is smart relative to the index it was constructed on. Identical logic for any hybrid scoring or BM25 statistics.

The usual reply in 2024 RAG tutorials is embeddings: flip every web page right into a vector, rating by cosine similarity. Article 2 is devoted to them. For the minimal model, we intentionally don’t, for one cause.

Embeddings are opaque. Cosine similarity returns a quantity like 0.7798 and asks the consumer to belief that “web page 6 is related to the query”. Present that rating to a website skilled, a product proprietor, or a supervisor: no one understands what 0.78 means, or why it’s larger than 0.65. Builders could argue they perceive it (“dot product of normalized vectors”). They perceive the maths, not the relevance. Requested why this particular web page scored 0.7798 in opposition to this particular query, they shrug and level on the mannequin.

In an enterprise context, retrieval is the step customers query probably the most. Why did the system have a look at this web page and never that one? It’s important to clarify it. So the minimal model makes use of one thing we will learn with our personal eyes: key phrase matching. Part 2.2 pulled the key phrases; we rating every web page by what number of of these key phrases seem in it, and maintain the highest three.

The place we search vs what we return: each pages right here. Actual retrieval has two ranges. The anchor is the place the key phrase or embedding really hits (a line, a sentence). The context is what we hand to era (the traces round it, the web page). We search small, we return massive. Right here we use the web page for each. That works on an educational paper the place every web page is roughly one concept. Article 7 separates the 2 ranges for lengthy contracts, multi-column experiences, table-heavy paperwork.

page_df = build_page_df(line_df) collapses the 1048 traces into 15 pages, one row per web page.

*First 5 rows of page_df, one row per web page with the total textual content concatenated – Picture by creator*

2.3.a Embeddings + cosine similarity

Embed each web page (one name per web page), embed the query, compute cosine similarity, maintain the top-k. The output: a quantity like 0.7798 per web page. Have a look at the scores under: are you able to inform why a web page made the highest three? May you clarify the rating to a website skilled? That’s the opaque-score downside the article opens with.

*Prime three pages by cosine similarity. Exact scores, opaque rating – Picture by creator*

Three numbers, all very shut to one another (0.7843, 0.7798, 0.7728). Are you able to say why web page 9 beats web page 6? The textual content preview makes it apparent: web page 9 is the Variations on the Transformer structure desk, web page 5 is about output values and concatenation, web page 6 is the Most path lengths desk. The web page that truly solutions the query, part 3.5 Positional Encoding, sits on web page 6 and ranks final within the high three. The unrelated web page 5 ranks second. The scores look exact, however the rating has no story behind it: there isn’t a token to level at, no phrase to defend, only a dot product on two black-box vectors. Embeddings work in lots of circumstances, and Article 2 unpacks the place this rating comes from. However the rating itself by no means turns into interpretable, and for the remainder of this text we use a retriever you’ll be able to learn with your personal eyes.

2.3.b Key phrase matching

For every web page, depend what number of of parsed_question.key phrases seem in it (case-insensitive substring match). Drop pages with zero matches; maintain the top-k by match depend. The output desk under carries the precise matched_keywords per web page, so anybody can learn it and see why a web page was picked.

retrieve_pages(page_df, line_df, parsed_question.key phrases, top_k=3) returns the highest three pages by key phrase depend plus the filtered traces: 314 traces saved from pages 6, 9, 7.

*Prime three keyword-matched pages, with the matched phrases proven per web page – Picture by creator*

Three pages, ranked by match depend, with the precise matches laid out. Pages 6, 8, and 9 every comprise the literal phrase positional encoding; web page 6 holds Part 3.5 Positional Encoding with the precise reply. Anybody studying the desk can confirm the end result by hand: search the supply for positional encoding and also you’ll discover these three pages.

Two design decisions:

Drop pages with zero matches. A retrieval that claims “nothing matches” is extra helpful than one which pads with three random pages. The schema’s null path (subsequent subsection) handles the empty case cleanly.
We don’t break ties. When pages tie on the similar match depend, the order is no matter pandas’ nlargest returns. The downstream LLM sees the traces from all tied pages in doc order and decides.

From 1048 traces to 300, and we all know the suitable materials is in there.

def cosine_sim_matrix(query_vec, doc_matrix):
    q = query_vec / (np.linalg.norm(query_vec) + 1e-12)
    d = doc_matrix / np.linalg.norm(doc_matrix, axis=1, keepdims=True)
    return d @ q

def retrieve_pages(page_df, line_df, query, top_k=3):
    q_vec = np.asarray(get_embedding(query), dtype=np.float32)
    doc_matrix = np.vstack(page_df["embedding"].values)
    sims = cosine_sim_matrix(q_vec, doc_matrix)

    scored = page_df.copy()
    scored["similarity"] = sims
    retrieved_pages_df = scored.nlargest(top_k, "similarity")

    kept_pages = retrieved_pages_df["page_num"].tolist()
    filtered_line_df = line_df[line_df["page_num"].isin(kept_pages)]
    return retrieved_pages_df, filtered_line_df

Be aware: the “break up into particular person phrases” lure. A pure reflex when the multi-word phrases don’t match: break up them and seek for the person tokens. Beneath we broaden each key phrase into its phrases, deduplicate, then re-run retrieval. We get matches, and we additionally get false positives, as a result of phrases like encoding, transformer, community seem everywhere in the doc in unrelated contexts.

Now each web page within the high three matches a number of tokens, however have a look at which tokens. Phrases like encoding and transformer cowl many of the paper. Pages about layer encoding or encoder stacks look as related because the web page that truly solutions the query. Splitting trades one failure (zero matches) for one more (false positives). Article 7 covers the true fixes (synonym enlargement by means of a dictionary, hybrid scoring); for now, maintain the phrase complete.

2.3.c A more durable query: the place every retriever breaks

Identical pipeline, a special query. We ask concerning the worth of epsilon utilized in label smoothing. The reply is on web page 8 of the paper, written as ε_ls = 0.1 (Greek letter ε, by no means the English phrase epsilon). Watch what every retriever does.

question_2 = "What's the worth of epsilon utilized in label smoothing?"
parsed_question_2 = get_keywords_from_question(question_2, consumer=consumer)
print("Key phrases:", parsed_question_2.key phrases)

Key phrases: ['epsilon', 'label smoothing']

Two failures of various shapes:

Embeddings rank pages by topical proximity. The best web page (web page 8, the place ε_ls = 0.1 lives) could or will not be within the high three. Pages dense in math notation come up even after they’re unrelated.
Key phrases are blind to symbols. The LLM emits epsilon, label smoothing, and so forth. The doc writes the Greek letter ε. Substring match returns zero on something that mentions epsilon by image solely. The web page that accommodates the reply is invisible to the key phrase retriever.

Part 4.4 picks this up because the bridge to Article 2 (Embeddings deal with synonyms and floor variation) and Article 6 (richer Query Parsing pulls in alternate options just like the Greek letter).

2.4 Technology

We ship the retrieved traces to the LLM with the query, formatted as a tab-separated block the place page_num and line_num sit subsequent to every line. That format offers the LLM the precise coordinates it must cite.

In: the unique query, line_df, and the retrieved web page numbers from part 2.3.

Out: an AnswerWithEvidence, a structured JSON with the reply, the proof span (start_page_num, start_line_num, end_page_num, end_line_num), a confidence, a justification, the precise quotes, and any caveats.

class AnswerWithEvidence(BaseModel):
    reply: str = Area(...)

    start_page_num: int | None
    start_line_num: int | None
    end_page_num: int | None
    end_line_num: int | None

    confidence: float = Area(..., ge=0.0, le=1.0)
    justification: str = Area(...)

    quotes: listing[str] = Area(default_factory=listing)
    caveats: listing[str] = Area(default_factory=listing)

The uncooked JSON is value saving in manufacturing: justification, quotes, caveats, and confidence all feed analysis, audit, and replay, effectively past the reply discipline a chat UI exhibits.

We serialize the filtered traces right into a TSV with header page_numtline_numttext, one row per line. The LLM sees the precise coordinates subsequent to every textual content fragment so it could actually cite by (page_num, line_num) in its reply.

That is what makes the reply grounded: the schema forces the mannequin to fill in (start_page, start_line, end_page, end_line), a verbatim quote, and caveats if something is unsure. No prose, solely a typed object with citations.

We name reply = llm_answer_with_evidence(query, filtered_line_df, consumer=consumer) and get again an AnswerWithEvidence occasion, rendered under as a styled JSON picture so the sphere labels keep legible.

def llm_answer_with_evidence(query, filtered_text_prompt):
    resp = consumer.responses.parse(
        mannequin=model_chat,
        enter=[
            {
                "role": "system",
                "content": (
                    "Answer using ONLY the provided lines. "
                    "Return JSON only."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Lines:n{filtered_text_prompt}nn"
                    f"Question:n{question}nn"
                    "Pick a contiguous evidence span."
                ),
            },
        ],
        text_format=AnswerWithEvidence,
        retailer=False,
    )
    return resp.output_text

We name reply = llm_answer_with_evidence(query, filtered_line_df, consumer=consumer) and get again an AnswerWithEvidence occasion.

{
  "reply": "The choices for positional encoding talked about are discovered positional embeddings and stuck positional encodings (particularly, utilizing sine and cosine capabilities of various frequencies).",
  "start_page_num": 6,
  "start_line_num": 31,
  "end_page_num": 6,
  "end_line_num": 32,
  "confidence": 0.98,
  "justification": "Strains 31–32 explicitly state: 'There are various decisions of positional encodings, discovered and stuck [9].' Moreover, additional traces element the sinusoidal encoding because the fastened selection, and Desk 3 row (E) discusses utilizing discovered embeddings as an alternative.",
  "quotes": [
    "There are many choices of positional encodings, learned and fixed [9]."
  ],
  "caveats": [
    "Further details about the specific implementation of learned embeddings are only touched on elsewhere, but both options are mentioned here."
  ],
  "complete_answer_found": true,
  "context_structured": true,
  "llm_discovered_keywords": [
    "learned positional embeddings",
    "fixed positional encodings",
    "sinusoidal positional encoding"
  ]
}

Three issues occurred that matter:

The reply is right. Each choices recognized, paraphrased appropriately.
The proof span (web page 6, traces 26-44) factors to a selected area. Not “someplace on web page 6”. Actual traces.
The mannequin couldn’t have hallucinated a quotation: it solely noticed traces from the retrieved pages, and the schema compelled an actual (web page, line) vary we will confirm.

If the mannequin can’t fill the schema, null fields are allowed and caveats information why. Article 8 develops the schema right into a a lot richer type with per-brick suggestions fields; Article 23 builds the storage structure round it.

Sanity examine. On a paper this brief we will additionally ship the whole line_df to the LLM with no retrieval and examine the reply matches. Reassuring right here, gained’t scale to massive paperwork.

{
  "reply": "The choices talked about for positional encoding are sinusoidal positional encodings (utilizing sine and cosine capabilities of various frequencies) and discovered positional embeddings.",
  "start_page_num": 6,
  "start_line_num": 27,
  "end_page_num": 6,
  "end_line_num": 41,
  "confidence": 0.99,
  "justification": "Strains 6:27-6:41 describe including 'positional encodings' to the enter embeddings, specify the sinusoidal methodology, and point out experimenting with discovered positional embeddings, stating each choices had been tried and produced almost equivalent outcomes.",
  "quotes": [
    "Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add 'positional encodings' to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension dmodel as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9]. On this work, we use sine and cosine capabilities of various frequencies: ... We additionally experimented with utilizing discovered positional embeddings [9] as an alternative, and located that the 2 variations produced almost equivalent outcomes (see Desk 3 row (E)). We selected the sinusoidal model as a result of it might permit the mannequin to extrapolate to sequence lengths longer than those encountered throughout coaching."
  ],
  "caveats": [
    "Exact mathematical formulas for sinusoidal encoding are present here, but full details for learned embeddings are not. Table 3 row (E) and further details may expand on results but are not needed for the options question."
  ],
  "complete_answer_found": true,
  "context_structured": true,
  "llm_discovered_keywords": [
    "sinusoidal positional encoding",
    "learned positional embeddings",
    "sine and cosine functions",
    "relative or absolute position"
  ]
}

2.5 PDF annotation on the supply PDF

Now the satisfying half. We use the proof span to attract rectangles straight on the supply PDF.

In: the supply PDF and the proof span from the AnswerWithEvidence.

Out: an annotated PDF with rectangles drawn across the cited traces.

Optionally available. A CLI software, a batch job, or an API could skip it; the reply with citations is already full after part 2.4.

Three calls do the work:

passage_lines_df_from_answer(line_df, reply) rebuilds the cited-line DataFrame from the proof span.
passage_bbox_by_page(passage_df) teams bounding bins per web page.
draw_passage_rectangles(pdf_path, bboxes_df, out_pdf_path) writes the annotated PDF.

*One bounding field per cited web page, wrapping each cited line on that web page – Picture by creator*

*PDF annotation in three steps: broaden the span, union per web page, draw rectangles – Picture by creator*

def passage_lines_df_from_answer(line_df, answer_json):
    a = json.hundreds(answer_json)
    sp, sl = a["start_page_num"], a["start_line_num"]
    ep, el = a["end_page_num"], a["end_line_num"]
    if sp is None: return line_df.iloc[0:0]
    masks = (
        line_df["page_num"].between(sp, ep)
        & ((line_df["page_num"] != sp) | (line_df["line_num"] >= sl))
        & ((line_df["page_num"] != ep) | (line_df["line_num"] <= el))
    )
    return line_df.loc[mask].copy()

def passage_bbox_by_page(passage_df):
    return passage_df.groupby("page_num", as_index=False).agg(
        x0=("x0", "min"), y0=("y0", "min"),
        x1=("x1", "max"), y1=("y1", "max"))

def draw_passage_rectangles(pdf_path, bboxes_df, out_path):
    doc = fitz.open(pdf_path)
    for _, r in bboxes_df.iterrows():
        web page = doc[int(r["page_num"]) - 1]
        web page.add_rect_annot(fitz.Rect(r["x0"], r["y0"], r["x1"], r["y1"]))
    doc.save(out_path)

*Consideration paper web page 6 with cited paragraph highlighted, subsequent to query and reply – Picture by creator*

The passage actually is the place the reply comes from. The purple field wraps the Positional Encoding paragraph: the sentence that introduces the selection (“we use sine and cosine capabilities of various frequencies”) and the two-line method straight under it. The reader can transfer from the chat reply to the quotation to the supply paragraph with out leaving the identical display screen. That’s the entire level.

Why a field round the entire paragraph and never the precise phrases? As a result of we labored on the line granularity: line_df carries one bounding field per textual content line, the LLM cites a (start_line, end_line) span, and passage_bbox_by_page collapses each line in that span into one wrapping rectangle. If you wish to draw the field across the precise phrases sin(pos / 10000^(2i/d_model)) as an alternative of the entire paragraph, the method is similar. Simply change the granularity. Exchange line_df with a word-level word_df (PyMuPDF’s web page.get_text("phrases") offers you a bounding field per phrase), make the schema cite (start_word, end_word), and passage_bbox_by_page already does the suitable factor. Identical four-brick pipeline, finer scope.

3. Chaining the bricks, and testing the pipeline

3.1 The entire pipeline as one operate

The bricks chain right into a single name. Feed in a PDF and a query; get again a typed reply with line citations, and optionally an annotated PDF.

In: a PDF path and a textual content query (plus an non-compulsory top_k and an non-compulsory output PDF path).

Out: an AnswerWithEvidence, and (if annotate_pdf is given) an annotated PDF on disk.

Inside, pdf_qa_baseline chains doc parsing → query parsing → retrieval → era → PDF annotation. What crosses the retrieval → era boundary is simply the web page numbers; the filtered line_df is rebuilt inside era.

def pdf_qa_baseline(
    pdf_path: str,
    query: str,
    top_k: int = 3,
    annotate_pdf: str | None = None,
):
    # 1. Parsing
    line_df = fitz_pdf_to_line_df(pdf_path)

    # 2. Retrieval
    page_df = embed_page_df(build_page_df(line_df))
    _, filtered = retrieve_pages(page_df, line_df, query, top_k)

    # 3. Technology
    reply = llm_answer_with_evidence(query, filtered)

    # 4. Optionally available highlighting on the supply PDF
    if annotate_pdf just isn't None:
        passage = passage_lines_df_from_answer(line_df, reply)
        bboxes = passage_bbox_by_page(passage)
        draw_passage_rectangles(pdf_path, bboxes, annotate_pdf)

    return reply

{
  "reply": "The choices talked about for positional encoding are discovered and stuck positional encodings, particularly sinusoidal positional encodings (utilizing sine and cosine capabilities of various frequencies) and discovered positional embeddings.",
  "start_page_num": 6,
  "start_line_num": 31,
  "end_page_num": 6,
  "end_line_num": 41,
  "confidence": 0.99,
  "justification": "Strains 31-41 talk about the alternatives for positional encodings, stating that there are various decisions together with discovered and stuck encodings. It then explains the usage of sine and cosine capabilities (sinusoidal encoding) and notes that discovered positional embeddings had been additionally experimented with.",
  "quotes": [
    "There are many choices of positional encodings, learned and fixed [9].",
    "On this work, we use sine and cosine capabilities of various frequencies: ...",
    "We additionally experimented with utilizing discovered positional embeddings [9] as an alternative, and located that the 2 variations produced almost equivalent outcomes (see Desk 3 row (E))."
  ],
  "caveats": [],
  "complete_answer_found": true,
  "context_structured": true,
  "llm_discovered_keywords": [
    "positional encodings",
    "learned",
    "fixed",
    "sinusoidal",
    "sine and cosine functions",
    "learned positional embeddings"
  ]
}

That is the API of the article. Later articles construct a sister operate ask_corpus(query, corpus, ...) for archive-scale work: similar contract (typed reply with citations), totally different scope (filter the corpus first, then run document-level work on the matching paperwork).

3.2 Strive it on a special doc

Drop in any PDF you have got round: a paper from your personal discipline, a contract, a report from work. Right here we decide the World Financial institution’s April 2026 Commodity Markets Outlook (World Financial institution publication, April 2026 difficulty; CC BY 3.0 IGO, as declared on the World Financial institution Open Data Repository publication web page for this difficulty): a 69-page report on vitality, agriculture, and fertilizer markets, removed from a analysis paper in tone and construction.

Identical 4 bricks, similar default prompts, similar retrieve_pages, similar schema. Nothing concerning the pipeline modifications for a brand new doc.

We begin with a query whose reply lives deep within the report, within the metals chapter fairly than the Govt Abstract: the outlook for aluminum costs in 2026.

We name pdf_qa_baseline end-to-end: go the CMO PDF, the aluminum query, top_k=3, and an annotate_pdf path so the pipeline additionally writes the highlighted supply. The returned answer_cmo_al is similar AnswerWithEvidence form we noticed on the Consideration paper.

{
  "reply": "Aluminum costs are projected to rise by about 22 % in 2026 (y/y) to achieve an all-time excessive—about 21 % larger than their January 2026 projections—supported by tight provide circumstances and stable demand development. Costs are anticipated to say no by about 6 % in 2027 as provide circumstances regularly ease.",
  "start_page_num": 45,
  "start_line_num": 32,
  "end_page_num": 45,
  "end_line_num": 43,
  "confidence": 0.98,
  "justification": "The chosen span explicitly supplies the projected share enhance for aluminum costs in 2026, the context for these actions, and the outlook for 2027. It additionally mentions the record-high stage forecast and elements driving the worth.",
  "quotes": [
    "Aluminum prices are projected to rise by about 22 percent in 2026 (y/y) to reach an all-time high—about 21 percent higher than their January 2026 projections—supported by tight supply conditions and solid demand growth (table 1).",
    "Prices are expected to decline by about 6 percent in 2027 as supply conditions gradually ease."
  ],
  "caveats": [],
  "complete_answer_found": true,
  "context_structured": true,
  "llm_discovered_keywords": [
    "all-time high",
    "tight supply conditions",
    "solid demand growth"
  ]
}

The composite view locations the highlighted supply web page subsequent to the query and the reply, so the quotation might be checked at a look:

A more durable query on the identical report. What if we ask about one thing the report mentions solely in passing? We strive the AI-related electrical energy demand query, whose reply the World Financial institution developed solely in an “Upside threat” sidebar on web page 31.

Identical name form, more durable query: pdf_qa_baseline(pdf_path=pdf_path_cmo, query=question_cmo_ai, top_k=3, ...). The pipeline should determine whether or not the retrieved pages really carry the AI-electricity determine or whether or not to flag the reply as not discovered.

{
  "reply": "The offered traces point out that faster-than-anticipated enlargement of AI-related knowledge facilities may increase demand for sure metals like aluminum and copper, however don't quantify the contribution of AI-related knowledge facilities to international electrical energy demand development.",
  "start_page_num": 47,
  "start_line_num": 39,
  "end_page_num": 47,
  "end_line_num": 40,
  "confidence": 0.8,
  "justification": "The one point out of AI-related knowledge facilities is in relation to demand for metals, not electrical energy demand. There is no such thing as a quantitative estimate or share given for his or her influence on international electrical energy demand development.",
  "quotes": [
    "Also, faster-than-antici-npated expansion of AI-related data centers could nboost demand for aluminum and copper, driving nprices higher."
  ],
  "caveats": [
    "No specific figures or direct statements about global electricity demand growth caused by AI-related data centers were found in the provided lines."
  ],
  "complete_answer_found": false,
  "context_structured": true,
  "llm_discovered_keywords": [
    "AI-related data centers",
    "electricity demand growth",
    "boost demand for aluminum and copper"
  ]
}

*CMO web page 47, null-path response: the schema refused to manufacture when the reply wasn’t there – Picture by creator*

However how can we make sure the reply actually doesn’t exist within the doc? Strictly, we will’t, not less than not from this null path alone. What the schema says is “the LLM didn’t discover the reply within the traces it was proven”, which is a special declare from “the reply just isn’t within the doc”. The Upside-risk sidebar on web page 31 of the identical CMO report does quantify the determine (the World Financial institution cites the IEA’s 8% projection of world electrical energy demand development from 2024 to 2030). The default key phrase pipeline pulled web page 47 and close by pages as an alternative, the place the report’s prose discusses AI’s impact on steel demand. Proving absence would require both operating the LLM on each web page, or a retrieval methodology that surfaces sidebar textual content and brief reference mentions. That’s precisely what Article 7 (Retrieval) develops; for the minimal model, “I didn’t discover it within the high three pages” is what we report.

3.3 Extra questions in a single desk

A small batch of 4 questions on the identical two paperwork, all ends in one desk. Learn the desk for patterns, not for each cell.

Numeric worth: studying price of the bottom Transformer. Particular quantity, anticipated web page 7 (part 5.3 on Adam optimizer).
No reply in doc: chemical composition of seawater. The schema’s null path ought to hearth; each retrievers will pull random-looking pages.
Totally different subject on CMO: outlook for urea costs. Identical pipeline on the fertilizer part of the World Financial institution report, removed from the AI sidebar.
Compound query: d_k and d_v within the Transformer. Two values requested directly. Additionally exams the table-parsing restrict (the values stay in Desk 1 web page 4, parsed as flat traces).

def run_pipeline_test(
    query: str,
    line_df_in: pd.DataFrame,
    page_df_in: pd.DataFrame,
    page_df_emb_in: pd.DataFrame,
    top_k: int = 3,
    consumer=consumer,
) -> dict:
    """Run each retrievers + era on one query; return a abstract dict."""
    parsed_q = get_keywords_from_question(query, consumer=consumer)
    retrieved_emb_df, _ = retrieve_pages_by_similarity(
        page_df_emb_in, line_df_in, query, top_k=top_k, consumer=consumer,
    )
    retrieved_kw_df, filtered_lines_kw = retrieve_pages(
        page_df_in, line_df_in, parsed_q.key phrases, top_k=top_k,
    )
    # If key phrase retrieval finds nothing, fall again to the entire doc so era
    # nonetheless runs (small PDFs solely: wouldn't scale to an actual corpus).
    lines_for_generation = (
        filtered_lines_kw if len(filtered_lines_kw) > 0 else line_df_in
    )
    reply = llm_answer_with_evidence(
        query, lines_for_generation, consumer=consumer,
    )
    return {
        "query": query,
        "key phrases": parsed_q.key phrases,
        "emb_top3": retrieved_emb_df["page_num"].tolist(),
        "kw_top3": (
            retrieved_kw_df["page_num"].tolist()
            if len(retrieved_kw_df) > 0 else "(no kw match)"
        ),
        "answer_excerpt": (reply.reply[:80] + ("..." if len(reply.reply) > 80 else "")),
        "cite_page": reply.start_page_num,
    }

*Identical pipeline on 4 questions: two succeed, one refuses cleanly, one journeys on desk parsing – Picture by creator*

Learn the desk left-to-right per row. 4 patterns to remove:

Key phrases beat embeddings on the studying price row. The bottom Transformer’s coaching schedule is on web page 7 (part 5.3, Optimizer). Embeddings rank pages 8/9/10; web page 7 is not within the high three. The key phrase retriever finds web page 7 instantly through the literal phrase studying price. Identical lesson because the epsilon row in part 2.3.c: when the query is determined by a exact time period the doc prints verbatim, key phrases are the higher software.
Each retrievers fail on the seawater row, and the failure is seen. The PDF has nothing to say about seawater. The key phrase column exhibits (no kw match) outright, with no false ‘top-3 pages’ that look believable. The schema then returns a null reply with a caveat. A clear ‘I don’t know’ is the system’s most beneficial conduct on out-of-scope questions.
Each retrievers work on the urea row. The CMO has a fertilizer part; embeddings and key phrases each carry again web page 42, era cites it appropriately. Cross-domain pipelines work so long as the query’s vocabulary lands on the doc.
The d_k and d_v compound row exposes the table-parsing restrict. The 2 values stay in Desk 1, web page 4 of the Transformer paper, the place every row lists d_model, h, d_k, d_v, and so forth. Our parser flattened the desk into plain traces, so a mannequin that asks for 2 cells facet by facet has to reassemble the row from textual content alone. Key phrases retrieve web page 4 (the literal phrase d_k seems there), however the quotation typically factors to 1 worth whereas the opposite is paraphrased. The repair is structural: parse tables as tables, not as traces. That’s Article 5 (parsing) and Article 6 (compound-question decomposition) doing their job.

4. The questions every block raises

What this minimal system does effectively:

An actual, verifiable reply. A structured object with the reply, the web page, the traces, the quote. The consumer can examine the quotation in seconds.
“Not discovered” dealt with cleanly. When the reply isn’t within the retrieved traces, the schema permits null fields and the caveats discipline says why. No fabrication.
The reply linked to the supply. The highlighted PDF closes the loop between the LLM’s declare and the doc. That is what separates a helpful RAG system from a chatbot that occurs to learn paperwork.
Simple to comply with. Every operate does one factor. No hidden state, no framework magic. When one thing goes improper, debugging is studying the code.

Now have a look at the identical system once more. Every block hides assumptions value questioning.

4.1 Doc parsing: we simply learn traces

We extracted textual content line by line. That’s affordable for an educational paper, however have a look at what we threw away: part construction, headings, desk layouts, figures, footnotes, cross-references. Web page 4 of this paper accommodates Desk 1 with the per-layer complexities. We parsed every of its rows as plain traces, dropping the desk construction solely. Web page 9 accommodates Desk 3, the ablation research. Identical downside.

For a query like “What are the choices for positional encoding?” this doesn’t matter. The reply is in steady prose. For a query like “What’s the per-layer complexity of self-attention?” it immediately does, as a result of the reply lives in a desk cell that our parser flattened into noise.

That’s the subject of Article 5: Parsing. Paperwork have construction. Ignoring it’s the single largest supply of downstream failure.

4.2 Query parsing: we requested for key phrases, however solely key phrases

Our question-parsing step extracts a flat listing of key phrases. That works on a clear query in opposition to an educational paper. It begins to interrupt down as quickly as questions get more durable.

Three issues this minimal model doesn’t do.

It doesn’t detect intent. “Summarize chapter 3”, “Translate this clause into French”, “Examine X and Y” every name for a special downstream pipeline. A single key phrases discipline can’t carry that sign.

It doesn’t decompose compound questions. “What are the exclusions and the deductible?” parsed as a flat key phrase listing pollutes the retrieval (the key phrases for “exclusions” and “deductible” pull in two totally different scopes that intrude). Article 6 walks by means of the right way to detect compound questions, determine whether or not to decompose, and route the sub-questions independently.

It doesn’t detect an anticipated reply form. “What’s the premium quantity?” needs a quantity with a foreign money. “What are the obligations?” needs a listing. “Examine the 2 insurance policies” needs a desk. The minimal model treats each reply as free textual content. Article 6 introduces the expected_answer_shape discipline that drives the era template downstream.

That’s the subject of Article 6: Query Parsing. The identical brick, a lot richer JSON.

4.3 Chunking: we aggregated by web page

We selected pages because the unit of retrieval. Why pages? Why not paragraphs, or sections, or fixed-size chunks of 512 tokens like each normal RAG tutorial recommends?

The reply is that page-level aggregation occurs to work for this paper as a result of pages roughly align with semantic items. On a contract, on a authorized textual content, on a technical guide with numbered clauses, pages are arbitrary cuts and also you’d need clause-level or section-level chunks as an alternative. The “proper” chunking is determined by the doc and the query, not on a default worth.

The temptation, when a fixed-size method begins failing, is to grid-search over chunk sizes and overlaps. That’s the machine studying reflex. It’s the improper body for what’s really a structural choice. Article 3: RAG Is Not Machine Studying, and the Six-Month Mistake of Treating It Like One makes that case in full.

4.4 Retrieval: key phrase matching is clear, however blind to vocabulary

Our retrieval simply labored. Web page 6 got here again with the matched key phrase, forward of the remainder, and the Positional Encoding part is on web page 6. Anybody can have a look at the match desk and see why. That’s the commerce we made: the only attainable retrieval, fully auditable.

The commerce has a value. Key phrase matching is blind every time the query’s vocabulary doesn’t match the doc’s. Three failure modes present up instantly on the identical paper.

Image vs phrase. Ask “What’s the worth of epsilon utilized in label smoothing?” The key phrases from query parsing are seemingly one thing like ["epsilon", "label smoothing"]. The precise reply (ε_ls = 0.1) sits on web page 8, however the doc writes it because the Greek letter ε, by no means the English phrase “epsilon”. The substring examine returns zero on the symbol-only web page; solely the literal phrase label smoothing lands on web page 8.

Synonym mismatch. Ask “How does the mannequin know the order of phrases in a sentence?” The key phrases could be ["word order", "sentence order"]. The doc calls this positional encoding. Not one of the query’s key phrases seem on web page 6. The retriever picks pages that occur to say “order” or “sentence” in passing, none of which comprise the reply.

Paraphrase. Ask “What consideration mechanism does the encoder use?” The doc says self-attention and Multi-Head Consideration, by no means the phrase “consideration mechanism the encoder makes use of”. The key phrases pulled from the query, even after enlargement, could or could not embody the doc’s precise phrasing. After they do, retrieval works. After they don’t, it silently degrades.

The primary two failures are so frequent that the remainder of the collection spends two articles on them.

Article 6: Query Parsing turns the key phrase extraction right into a a lot richer step that pulls from a website glossary, expands synonyms, and consists of seemingly doc phrasings fairly than the query’s literal phrases.
Article 2: Embeddings introduces vector representations that match throughout floor vocabulary: the place embeddings shine (synonyms, paraphrase, misspellings, cross-lingual matching), the place they quietly fail (negation, precise values, inside acronyms, polysemic phrases), and the right way to mix them with key phrase matching for the most effective of each worlds.
Articles 7 and 9 put the ensuing hybrid retrieval into an actual doc index.

The best reply is to mix, not decide a winner. The 2 strategies fail on nearly reverse circumstances: embeddings stumble when the query is determined by a exact image, named time period, or precise worth; key phrases stumble when the asker’s vocabulary doesn’t actually seem within the doc. Working each retrievers, taking the union of their candidates, and (optionally) re-ranking with a cross-encoder is the usual hybrid recipe. Article 2 develops it; Articles 7 and 9 wire it right into a corpus.

The minimal model stays single-retriever as a result of it teaches the suitable reflex first: the retriever have to be auditable. Key phrase matching makes that reflex concrete (you’ll be able to see precisely which phrases landed on which web page). As soon as that reflex is in place, embeddings change into a managed addition fairly than an opaque default, and mixing the 2 turns into a deliberate engineering selection fairly than a development.

4.5 Technology: we requested for sources, and we bought them

That is the block that labored finest, nearly too simply. We outlined a Pydantic schema with start_page_num, start_line_num, end_page_num, end_line_num, confidence, justification, quotes, and caveats, and the mannequin stuffed it in appropriately.

How rather more can we ask? A structured comparability for comparative questions, a listing of conflicts if the doc contradicts itself, a number of citations from a number of elements of the doc, a confidence breakdown per declare. Sure to all the above. The era step is way extra controllable than most groups understand. Article 8: Technology as Managed Execution explores this in depth.

5. The form of what comes subsequent

This minimal pipeline is the backbone of every thing that follows. Every a part of the collection goes deep on one of many questions raised above.

The errors that kill most initiatives come from getting the improper image of one in all these blocks: RAG isn’t ML (Article 3), embeddings aren’t magic (Article 2), not all RAG issues look the identical (Article 4). That’s Half I.

Every brick then will get its personal deep dive: doc parsing, query parsing, retrieval, era. That’s Half II, the 4 bricks.

As soon as the blocks are stable, we recombine them for circumstances that appear like manufacturing: lengthy paperwork, justification and absence dealing with, table-of-contents-driven retrieval, itemizing questions, structured extraction, the composite pipeline. That’s Half III.

Then we alter scale. From one doc to many. From a single paper to an archive of a whole bunch or 1000’s of paperwork. The structure modifications considerably. That’s Half IV.

Lastly, what it takes to function the system in manufacturing: analysis, value and monitoring, safety and compliance, the structure of the codebase itself. That’s Half V.

The blocks don’t change. Their internals do.

Just a few framing notes:

The 4 bricks (Half II) are the conceptual core. A lot of the remainder of the collection is about doing every one higher. Half III and Half IV are recombinations: the identical 4 concepts at totally different scales and for various query sorts.
The collection scope is enterprise paperwork. Contracts, technical specs, regulatory filings, inside procedures: all carry construction (TOC, sections, tables) and bounded vocabulary (trade jargon, skilled phrases). RAG works on these corpora due to that construction, not heroic embedding methods. Paperwork with no construction (novels, lengthy unstructured transcripts) and questions that require intent fairly than finding a passage are out of scope; Article 4 returns to the place the road falls.
Code is illustrative, not production-ready. What you’ve learn works on an actual PDF, however lacks the error dealing with, validation, caching, value controls, monitoring, and safety a manufacturing system wants. Every will get its personal article.

Right here’s the express map from this minimal system to the remainder of the collection:

PDF parsing throws away construction → Article 5, Article 10
Query parsing wants greater than key phrases (intent, decomposition, anticipated reply form) → Article 6
Chunking technique isn’t a hyperparameter → Article 3
Query vocabulary doesn’t match doc phrases → Article 2, Article 6
Retrieval picks the improper web page → Article 7, Article 9
Mannequin paraphrases its quotation → Article 8, Article 21
“Not discovered” wants nuance → Article 4
Compound, itemizing, comparability, summarization questions → Article 6, Articles 11-13
Multi-document corpus → Half IV (Articles 15-20)
Manufacturing, analysis, safety, structure → Half V (Articles 21-25)

You possibly can learn this

6. Conclusion

100 traces of Python and a Pydantic schema are sufficient to ship a working RAG system on an actual PDF. What makes the system reliable just isn’t the road depend : it’s the structured reply with line-level citations, the schema’s null path that refuses to manufacture, and the PDF spotlight that ties each declare again to its supply. The 4 bricks (parsing, query parsing, retrieval, era) are the conceptual core ; every thing that follows within the collection is about doing every one higher.

The minimal model is a baseline, not a vacation spot. The subsequent article tackles the misunderstanding that wrecks probably the most RAG initiatives : that RAG is a machine studying downside. It’s not.

7. Sources and additional studying

The structured-output-with-citations framing this text makes use of for AnswerWithEvidence is similar path as Bohnet et al. (Attributed Query Answering, 2022). The total production-grade equal of this sort of pipeline exhibits up in Anthropic’s Contextual Retrieval (Sept 2024), which Article 9 will preview. The time period RAG itself comes from Lewis et al. (2020). Quantity 3 (Agentic Bricks) returns to the agentic improve path on high of the 4 bricks outlined right here.

Identical path because the article:

Bohnet et al., Attributed Query Answering, 2022 (arXiv:2212.08037). Structured output with citations because the belief mechanism; the closest printed concept behind the AnswerWithEvidence schema.
Anthropic, Contextual Retrieval (Sept 2024 engineering publish). Manufacturing-grade “minimal however ship-ready” pipeline; lands on hybrid retrieval + reranking. Article 9 picks up the place this one stops.
Asai et al., Self-RAG: Studying to Retrieve, Generate, and Critique by means of Self-Reflection, ICLR 2024 (arXiv:2310.11511). Identical trust-via-structure path. Self-reflection tokens flag when retrieval helped and when a declare is grounded.
Lewis et al., Retrieval-Augmented Technology for Data-Intensive NLP Duties, NeurIPS 2020 (arXiv:2005.11401). The paper that named RAG.

Totally different angle, totally different context:

Karpukhin et al., Dense Passage Retrieval for Open-Area QA, EMNLP 2020 (arXiv:2004.04906). Dense retrieval because the manufacturing primitive; most “minimal RAG” tutorials descend from this. This text makes use of key phrase matching as an alternative (defended in Article 2).
Yao et al., ReAct: Synergizing Reasoning and Appearing in Language Fashions, ICLR 2023 (arXiv:2210.03629). Founding paper of agentic RAG. The context is general-purpose tool-picking at runtime. Quantity 3 (Agentic Bricks) develops this line on high of the 4 bricks outlined right here.
Lee et al., Can Lengthy-Context Language Fashions Subsume Retrieval, RAG, SQL, and Extra?, 2024 (arXiv:2406.13121). The long-context-replaces-retrieval method. Empirical knowledge on the place this works and the place it breaks; this text assumes long-context doesn’t change structured retrieval on enterprise PDFs.

Baseline Enterprise RAG, From PDF to Highlighted Reply

Construct a customized portal with embedded Amazon SageMaker AI MLflow Apps

The Statistics of Token Choice: Logits, Temperature, and High-P Walkthrough

The Statistics of Token Choice: Logits, Temperature, and High-P Walkthrough

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

How Cursor Really Indexes Your Codebase

Construct a serverless audio summarization resolution with Amazon Bedrock and Whisper

Context Engineering — A Complete Fingers-On Tutorial with DSPy

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

About Us

Category

Recent Posts