Context Engineering for RAG : The 4 Typed Inputs Behind Each RAG Reply

companion to Enterprise Doc Intelligence, a collection whose stance is that enterprise RAG amplifies the professional, it doesn’t change them. The structure follows from that: 4 bricks (doc parsing, query parsing, retrieval, era), every emitting typed items that converge on one LLM name. The business now calls that follow context engineering. Scope right here is the single-document case; corpus, dialog, and tool-call extensions are follow-up work.

*the place this text sits within the collection: Article 7bis (context engineering), the reframing companion to the 4 bricks – Picture by writer*

📓 Runnable notebooks are on GitHub: doc-intel/notebooks-vol1.

*The general public companion-code repo at doc-intel/notebooks-vol1 – Picture by writer*

By the point the 4 bricks of a single-document RAG are constructed, the meeting is settled. Parsing produces relational tables. Query parsing produces a typed ParsedQuestion. Retrieval produces a filtered subset of strains, plus an audit of the way it picked them. Era produces a Pydantic reply with cited proof. The entire thing converges on one LLM name, with a set system immediate and a consumer content material assembled from upstream items.

That pipeline has a reputation now. In June 2025 Tobi Lütke tweeted that “immediate engineering” was the fallacious body, and proposed “context engineering” as a substitute: “the artwork of offering all of the context for the duty to be plausibly solvable by the LLM.” Andrej Karpathy endorsed it per week later as “the fragile artwork and science of filling the context window with simply the correct info for the following step.” Inside months the time period was on the duvet of an O’Reilly ebook and structured right into a taxonomy by LangChain.

What follows reads the single-document RAG pipeline via that lens. Every brick emits typed items; the meeting stage threads them into the LLM name; the system immediate stays fastened for caching. Naming the follow doesn’t change the structure. It modifications what to name it when an auditor asks how the system works, and it tells the reader that the structure is the one manufacturing groups converged on in 2025.

1. The title, and what it covers

Immediate engineering used to imply two associated issues. Tuning the wording of 1 immediate to coax higher behaviour, and writing instance photographs so the mannequin knew what good output seemed like. Each are slim. They concern one block of textual content despatched to 1 name.

Context engineering covers every little thing that lands within the mannequin’s context window for one name:

The system immediate (the position, the foundations, the examples).
The retrieved paperwork or rows.
Dialog historical past when there’s one.
Software definitions and their outputs.
Reminiscence, scratchpads, agent state.
Structured metadata concerning the doc, the corpus, the challenge.
The precise consumer enter.

In a long-running agent that calls the mannequin dozens of occasions, the immediate is one among six or eight slots. The remaining comes from someplace upstream: a retriever, a software, a reminiscence retailer, a profile lookup. The self-discipline shifts from “what ought to I write within the immediate” to “what ought to I assemble within the context, the place does each bit come from, and the way do I preserve the meeting secure throughout calls.”

That’s engineering work. It appears to be like like software program structure: typed objects, contracts between parts, audit trails, caching. The 2025 time period is overdue, as a result of the follow was already there within the working manufacturing programs. Lütke and Karpathy named what groups had been already doing.

The collection occurs to have accomplished it from the beginning, brick by brick. The subsequent sections stroll via what every brick contributes to a single-document RAG payload, then via the 4 typed items that land within the LLM name and the code that produces each. The corpus, dialog, and tool-call circumstances come up on the finish as out-of-scope work, with tips that could the place within the collection they are going to be addressed.

*Seven typed bricks feeding the LLM’s context window, grouped by supply: query, paperwork, infrastructure. – Picture by writer*

2. Each brick emits typed context

The 4 bricks emit typed context channels that converge on the meeting band on prime, the place PromptContext, the fastened system immediate, and the consumer template mix earlier than the LLM name. – Picture by writer

The schema above is the recap of what the collection shipped. Every brick is a typed-context emitter. The names on the bins are the precise fields of the particular Pydantic courses and DataFrames the code produces.

Parsing emits relational tables and one synthesis dict. line_df carries one row per line with bbox. page_df carries one row per web page with kind and column depend. toc_df carries the table-of-contents entries with begin web page and depth. image_df carries embedded pictures with phash and metadata. parsing_summary is the doc-level synthesis: doc_type, n_pages, typical_fields, abstract, plus the mechanics fields. The retrieval brick consumes the per-row tables. The query parsing brick consumes the semantic subset of parsing_summary through DocContext.

Query parsing emits a ParsedQuestion. Its fields usually are not free-form. key phrases is a brief record of content material noun phrases for retrieval. intent is a literal label from a set enum that drives form dispatch in era. structural_hints.pages_hint carries pinned pages when the consumer stated “on web page 3”. answer_shape carries the anticipated output form (textual content, quantity, date, record, desk, handle) for the era schema lookup. Every discipline is consumed by a distinct downstream brick. None of them are handed as uncooked strings to the LLM.

Retrieval emits a filtered DataFrame and an audit dict. filtered_line_df is the subset of line_df the era brick sees. anchor_pages is the web page IDs that had been stored and why. The retrieval_audit carries the tactic that gained (key phrase, TOC, LLM arbiter), the LLM TOC reasoning when relevant, and the chosen sections. The filtered body is what the LLM reads. The audit is what an auditor reads.

Era is a client, not an emitter. It takes the query, the filtered strains, the PromptContext, and the reply schema. It calls the LLM. It returns a Pydantic typed reply. The dashed border on the Era field alerts that position.

The violet “PROMPT ASSEMBLY” zone on the correct is the place context engineering occurs as code. The collection implements it through three primitives:

A PromptContext(BaseModel) aggregator with one discipline per upstream context supply: doc_context, future corpus_context, future project_context.
A hard and fast MODULE_SYSTEM_PROMPT on the module stage for every brick that calls the LLM.
A MODULE_USER_TEMPLATE with named placeholders the brick fills through str.format(...).

Article 1 (the minimal four-brick RAG) launched the bricks as a stream. Article 6A (the query parsing thesis) made the query parser typed. Article 8A (the typed era contract) makes the era schema typed. This text reads the identical 4 bricks via the lens of “what context does each contribute, how do they attain the LLM name with out polluting one another.” Identical code, totally different lens.

3. The 4 typed items of a single-document payload

What lands within the LLM name for a single-document RAG is 4 items, every produced by a distinct piece of code, every with a distinct cost-and-cache profile. This part walks the 4 within the order they seem within the consumer content material the LLM reads.

3.1 The fastened system immediate

The primary piece is the system message. The position description, the foundations, the examples. It doesn’t change throughout calls. The collection writes it as a Python fixed on the module stage, then exposes it as a kwarg with a default so a caller can override per area with out forking:

PARSE_QUESTION_SYSTEM_PROMPT = (
    "You extract content material noun phrases from the consumer's query..."
)

def parse_question(query, *,
                   system_prompt: str = PARSE_QUESTION_SYSTEM_PROMPT,
                   user_template: str = PARSE_QUESTION_USER_TEMPLATE,
                   context: PromptContext | None = None):
    ...

Two operational penalties. The immediate is cacheable by the LLM supplier, as a result of it doesn’t change throughout calls on the identical mannequin. Cached enter prices roughly ten occasions lower than recent enter on the suppliers that publish a tariff. And the immediate is auditable, as a result of it lives at a secure Python image an auditor can grep, model, and diff between releases.

3.2 The retrieved strains, filtered by the dispatcher

The second piece is the strains the LLM really reads. The dispatcher consumes ParsedQuestion.key phrases and structural_hints, picks a way (key phrase, TOC, LLM arbiter), and returns the filtered body plus the audit. The consumer content material will get the filtered body; the audit lives on disk for the operator to examine later:

retrieved, filtered_line_df, audit = dispatch_page_retrieval(
    query, line_df, page_df,
    toc_df=toc_df, key phrases=key phrases,
    top_k=5, use_toc=True,
)

What ships to the LLM in consumer content material is the filtered body, not the entire doc. A 200-page contract turns into ten pages of related strains. The consumer content material stays underneath a number of thousand tokens. The audit explains why every web page made it in, so a caller can problem the choice with out re-running the decision.

3.3 The doc-context block, compact JSON

The third piece is the doc-level synthesis: doc kind, web page depend, typical fields, abstract. It lands within the consumer content material as a compact JSON object so the LLM can scope ambiguous wording in opposition to the doc’s nature. The collection implements it as a way on each context-carrying Pydantic class. DocContext.as_prompt_json() builds the smallest JSON that also names the 4 fields; null and empty values are dropped:

class DocContext(BaseModel):
    doc_type: str | None = None
    n_pages: int | None = None
    typical_fields: record[str] = []
    abstract: str | None = None

    def as_prompt_json(self) -> str:
        payload = {okay: v for okay, v in self.model_dump().gadgets()
                   if v shouldn't be None and v != []}
        return json.dumps(payload, separators=(",", ":"))

Measured on a CV with doc_type="resume", n_pages=1, and 4 typical fields, the payload is underneath 200 characters. On an unknown doc the place each discipline is null or empty, the payload is the empty object {} and the bloc is omitted solely from the consumer content material. The identical sample applies to the reserved corpus-context and project-context slots when later articles activate them.

3.4 The `PromptContext` aggregator that wraps the three above

The fourth piece is the aggregator. Every LLM-calling brick takes one optionally available context: PromptContext kwarg. The aggregator carries the doc-context in its personal typed slot right this moment, with reserved slots for the corpus-context and project-context the follow-up articles will activate. The helper render_context_block(context) walks the non-null fields and emits one labelled JSON bloc per layer on the head of the consumer content material:

class PromptContext(BaseModel):
    doc_context:     DocContext | None = None
    # corpus_context:  CorpusContext  | None = None  # reserved
    # project_context: ProjectContext | None = None  # reserved

Every LLM brick takes one optionally available context: PromptContext kwarg. The helper render_context_block(context) walks every non-null discipline, renders its compact JSON, and emits one labelled bloc per layer. Including a brand new layer means uncommenting one discipline, including two strains within the helper, and each brick picks the brand new layer free of charge. The signature is secure throughout releases.

4. What modifications in follow

Naming the follow modifications three operational issues, even with the code unchanged.

Audit. When the reply is fallacious, the query is now not “what did the immediate say.” The query is “what landed within the context window for that decision.” The collection persists each brick output to disk: parsing/, questions//parsed_question.json, retrieval//retrieved_pages.parquet, retrieval//retrieval_audit.json. The auditor reconstructs the context payload from these recordsdata. Then the query turns into particular: was the doc_context fallacious, had been the fallacious pages chosen, did the system immediate drift between releases, was the consumer template stale. Every of these has a distinct repair.

Value. Two levers compound. The system immediate is fastened throughout calls on the identical mannequin, so it pays cached-input tariff. The consumer content material has been compressed through as_prompt_json and chosen through retrieval, so the variable half is small. On a corpus of 100 paperwork with 10 questions every, the dominant value is the variable half occasions 1000 calls. Naming the follow doesn’t change the mathematics, however it makes the funds for every name legible: each line within the context payload has a generator that somebody can level at.

Composition throughout follow-up work. The PromptContext aggregator has one discipline activated right this moment, with two extra reserved for the corpus-context and project-context layers a later piece of the collection provides. When these land, this text doesn’t want a rewrite. The signature stays. The physique of render_context_block grows by one department. Each brick that already takes context: PromptContext | None picks up the brand new sub-context free of charge. The self-discipline pays off in deferring breakage throughout releases.

5. Out of scope, with pointers

The only-document case stops right here. Context engineering at massive covers three issues this text doesn’t contact:

Corpus context. When the reply requires studying throughout many paperwork, the LLM wants a way of which paperwork are in scope and what they’ve in frequent. That lives in a future CorpusContext Pydantic, fed by an aggregator over per-document parsing_summary values. The slot is reserved in PromptContext so the brick signatures don’t change. A later article walks the construct and the buyer wiring.
Dialog historical past. Multi-turn chat carries prior query / reply pairs the LLM ought to contemplate earlier than answering the brand new query. That could be a state downside (the place does the historical past dwell, when is it summarised, when is it pruned) on prime of a context downside. A later article within the collection treats it as a first-class brick.
Software calls. Agent loops carry software definitions, software outputs, and intermediate state into the context window. The choice / compression / isolation issues get sharper there as a result of the context window fills up rapidly throughout turns. A later article within the collection treats agentic context engineering as its personal subject.

The 4 canonical methods the LangChain weblog names (write, choose, compress, isolate) had been developed with the agent loop in thoughts. Two of them (write and choose) translate cleanly to the single-document case because the system immediate and the retrieval dispatcher. The opposite two (compress and isolate) apply in spirit however chew more durable as soon as corpus and dialog enter the image, which is why this text doesn’t drive the four-way mapping.

See it dwell

A brief dwell companion runs within the shipai dashboard. Click on any candidate web page within the audit path, then click on anchor / paragraph / part / web page within the picker above.

*The shipai dwell demo: similar anchor, 4 context-scope selections facet by facet, the consumer widens the spotlight to see the tradeoff – Picture by writer*

Identical anchor, 4 context-scope selections facet by facet. anchor is one line. paragraph is ±5 strains on the identical web page. part makes use of the TOC to widen to the part physique. web page fills the entire web page. The article’s trade-off (value vs precision) turns into a slider you may really feel on an actual PDF as a substitute of a paragraph of prose.

6. Conclusion

The 2025 business dialog round context engineering provides a reputation to a self-discipline single-document RAG already practises brick by brick. Parsing emits relational tables and a doc-level synthesis. Query parsing emits a typed ParsedQuestion whose fields every drive a distinct downstream brick. Retrieval emits a filtered line set plus an audit. Era consumes the assembled payload via a set system immediate, a templated consumer content material, and a PromptContext aggregator with one typed slot per upstream layer.

The label is what modifications: an auditor, a hiring supervisor, or a vendor studying the structure can place it contained in the 2025 vocabulary with out additional translation. The bricks, the schemas, and the cost-versus-cache trade-offs are unchanged. The corpus, the dialog, and the tool-call circumstances come up as follow-up work, every with its personal typed slot reserved in the identical aggregator.

7. Sources and additional studying

The 2025 dialog, in chronological order.

Walden Yan, Don’t construct multi-agents, Cognition, June 12 2025. The earliest piece that names the self-discipline. Yan’s declare that “context engineering is successfully the #1 job of engineers constructing AI brokers” is the road Lance Martin later quotes when he introduces the four-strategy taxonomy.
Tobi Lütke, X, June 18 2025. The naming tweet: “I actually just like the time period ‘context engineering’ over immediate engineering. It describes the core ability higher: the artwork of offering all of the context for the duty to be plausibly solvable by the LLM.”
Lance Martin, Context Engineering for Brokers, June 23 2025. The taxonomy paper. Additionally republished on the LangChain weblog underneath the LangChain Crew byline.
Andrej Karpathy, X, June 25 2025. The endorsement: “+1 for ‘context engineering’ over ‘immediate engineering’. Folks affiliate prompts with brief process descriptions you’d give an LLM in your day-to-day use. In each industrial-strength LLM app, context engineering is the fragile artwork and science of filling the context window with simply the correct info for the following step.”
Drew Breunig, Learn how to Repair Your Context, June 26 2025. A parallel taxonomy: six concrete ways (RAG, Software Loadout, Context Quarantine, Context Pruning, Context Summarization, Context Offloading) for holding the context window wholesome.

The taxonomies, facet by facet.

Lance Martin: 4 methods for the agent loop (write, choose, compress, isolate). Single-document RAG interprets the primary two cleanly; the opposite two chew more durable as soon as corpus and dialog enter the image.
Drew Breunig: six ways (RAG, Software Loadout, Context Quarantine, Pruning, Summarization, Offloading). Extra fine-grained, much less summary. Helpful when the agent loop is already operating and the context window is filling up.

The longer remedies.

Counterpoints.

Weaviate, Context Engineering e book (23 p, December 2025). The seller framing: six parts (Brokers, Question Augmentation, Retrieval, Prompting Methods, Reminiscence, Instruments). The collection’ place on this rebrand, the place the relabelling tracks the product line quite than the follow, is roofed in a follow-up critique put up.
Roadie weblog, Why Conflating RAG with Context Engineering Prices You in Manufacturing. The other framing: holding RAG and context engineering distinct, with retrieval as one slot amongst many.

The collection primitives this text references.

PromptContext aggregator and DocContext projection: src/docintel/core/schemas/.
render_context_block helper: src/docintel/core/prompts.py.
Module-level system prompts and consumer templates: each LLM-calling module underneath src/docintel/, by conference. Earlier within the collection:
Amplify the Skilled: A Philosophy for Constructing Enterprise RAG. The collection’ manifesto: the 4 bricks (parsing, query parsing, retrieval, era) are designed to scale the professional’s judgement, not change it.

Half I: What works, what breaks

Baseline Enterprise RAG, from PDF to highlighted reply. The four-brick pipeline finish to finish: PDF in, highlighted reply out.
Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval. The place embedding similarity wins (synonyms, typos, paraphrase), the place it predictably breaks (unknown phrases, negation, term-vs-answer relevance), and use it anyway.
RAG shouldn’t be machine studying, and the ML toolkit solves the fallacious downside. Why chunk-size sweeps and finetuning optimize the fallacious factor; route by query kind as a substitute.
From regex to imaginative and prescient fashions: which RAG approach suits which downside. Two axes, doc complexity and query management, that choose the approach for every case.

Half II: The 4 bricks

Doc parsing

Past extract_text: the 2 layers of a PDF that drive RAG high quality. The primary half of the parsing brick: the doc’s nature, alerts, and abstract.
Cease returning flat textual content from a PDF: the relational tables RAG wants. The second half of the parsing brick: the relational tables each downstream brick reads.
- When PyMuPDF can’t see the desk: parse PDFs for RAG with Azure Structure. The identical tables from Azure Structure: native desk cells, OCR, paragraph roles.
- Parse PDFs for RAG domestically with Docling: wealthy tables, no cloud add. The identical tables computed domestically with Docling: TableFormer cells, nothing leaves the machine.
- Imaginative and prescient LLMs are PDF parsers too: studying charts and diagrams for RAG. Imaginative and prescient as a parser: the photographs grow to be searchable textual content.
- Parse scanned PDFs for RAG with EasyOCR: free OCR provides you phrases, not a doc. The place conventional OCR stops: textual content recovered, construction misplaced.
- Making a PDF’s pictures searchable for RAG, with out paying to learn all of them. The picture cascade: filter low cost, classify, describe solely what’s value studying.
- Reconstructing the desk of contents a PDF forgot to ship, so RAG can scope by part. Rebuilding toc_df when the PDF prints a contents web page however ships no define.

Query parsing

Parse the query earlier than you search: the lacking step in most RAG pipelines. The thesis of query parsing: why a consumer string wants the identical parsing as a doc, and the way it splits right into a retrieval temporary and a era temporary.
5 fields RAG ought to extract from any query: key phrases, scope, form, decomposition, clarification. The 5 households of columns the parser reads straight from the consumer’s query, with the code that fills each.
One parsed RAG query, 4 choices: chunk technique, mannequin tier, fragments, audit path. The selections the parser makes on prime of the consumer string, utilizing the doc’s profile: dispatch, activations, full schema, the audit path (pipeline_trace.json), and a broker-corpus walkthrough.

Retrieval

Context Engineering for RAG : The 4 Typed Inputs Behind Each RAG Reply

Implement a backup technique for Amazon Fast Sight BI belongings

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

How Cursor Really Indexes Your Codebase

Context Engineering — A Complete Fingers-On Tutorial with DSPy

Construct a serverless audio summarization resolution with Amazon Bedrock and Whisper

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

About Us

Category

Recent Posts