on a function the place I needed to remodel 100 messy compliance pdfs into structured JSON guidelines.
The brute pressure method was apparent: give the agent the supply textual content, clarify the duty, present examples, and ask it to generate the principles. Because it was the lowest-hanging fruit, I attempted it first.
At a look, the output seemed positive. The output JSON was legitimate and matched what I anticipated.
However as I used to be manually sampling the outcomes to examine for accuracy, the cracks appeared. Some guidelines had been too broad, others had been missed. Some guidelines did not protect the nuances of the unique textual content. I attempted utilizing one other agent to catch and repair the errors however with such an enormous corpus, it was unimaginable to confidently confirm the output.
That was the irritating half. The errors weren’t apparent. This was approach too fragile of an implementation to scale.
Although I can’t share the precise implementation particulars, what I can share are the architectural classes I learnt and the way I finally carried out it. Hopefully, these insights might be helpful for those who’re constructing AI programs that have to scale, keep dependable, and take care of messy knowledge. And in case you have higher methods of doing issues, do attain out to talk!
Okay let’s get to it.
The issue
The 100 pdfs I labored with had already been parsed and chunked earlier than they reached me. However the uncooked content material was nonetheless messy. There have been bullet factors, tables, OCR artefacts, translated sections, semi structured headings, footers, headers, inconsistent formatting and doc particular quirks.
I selected to make use of an agent as a result of deciding what mattered required semantic judgement. The paperwork didn’t comply with one constant sample, so relevance couldn’t be decided by way of easy guidelines alone.
You needed to perceive the encompassing context. None of this was tough when carried out on a small chunk of information. The problem was performing this reliably at scale.
These guidelines had been then processed by one other downstream system to be evaluated deterministically.
What ultimately labored
After just a few experiments, I realised the most important enchancment didn’t come from a greater immediate, a brand new instrument, an MCP server, or a extra refined agent harness.
It got here from altering the form of the issue.
As an alternative of attempting to make the agent smarter, I made the agent’s job smaller.
The primary change was to organize the supply knowledge upfront. As an alternative of asking the agent to question a database, retrieve data, resolve whether or not it had the best inputs, after which carry out the extraction, I gave it a extra managed place to begin.
In my case, that meant quickly storing the related uncooked knowledge regionally.
This will not at all times be sensible. However the underlying precept is to cut back the quantity of retrieval uncertainty the agent has to deal with. If the agent’s job is to cause over content material, don’t additionally make it answerable for determining whether or not it has discovered the best content material.
An alternative choice could be to organize the question upfront.
I additionally used a script to strip away pointless metadata and fields earlier than passing the uncooked content material to the agent. Much less irrelevant context meant fewer distractions, fewer possibilities for the agent to latch onto the improper particulars and a cleaner reasoning process general.
However an important change was the unit of labor.
As an alternative of processing every part directly, I did issues iteratively and processed one doc at a time.
That made every job smaller, simpler to examine, simpler to retry, and simpler to audit. I spun up 5 subagents to course of paperwork in parallel, with every agent logging its progress to a file.
If one doc failed, I might retry solely that doc. If one output had formatting points, I might repair that particular case with out rerunning the entire batch. If the pipeline stopped midway, the cached progress meant it might resume from the final profitable checkpoint.
This was additionally the place the separation of tasks turned clearer.
The agent dealt with the semantic work: understanding the content material, figuring out the related elements and writing the JSON output.
The encompassing code dealt with the mechanical elements: parallelising jobs, imposing the schema, producing IDs, writing information, caching progress, validating references, and checking whether or not the output could possibly be traced again to the unique supply.
I additionally had an orchestrator watch over the progress of the script.
Making the output auditable
A helpful design choice was including reference IDs to each generated rule. This meant that every output merchandise pointed again to a selected supply.
This made the output simpler to audit. As an alternative of asking, “Does this generated rule look proper?”, I might ask extra exact questions reminiscent of: does the referenced supply chunk exist? Is the quoted supply textual content really current in that chunk?
I might additionally get one other agent to selectively run audits on bigger and extra complicated paperwork to make sure that essential nuances had been preserved.
On high of that, I did a light-weight model of evals. I ran a small batch of uncooked paperwork by way of the workflow and manually reviewed the outcomes for protection and accuracy. A full golden dataset was not sensible for the scope of this process, however I nonetheless wanted a solution to show to myself that the workflow was working.
My aim was to not construct an ideal benchmark however to make the system auditable sufficient that I might examine the outputs, catch failures, and iterate towards a better accuracy bar.
For those who’ve acquired concepts on how I might have carried out this higher, let me know!
My greatest takeaway
The sample that labored was to cease treating the LLM as the entire system.
The system turned extra dependable not as a result of the agent turned excellent, however as a result of the workflow made its outputs simpler to hint, validate, and recuperate from.
Coincidentally, I used to be constructing this shortly earlier than attending the inaugural AI Engineer Singapore convention, held from 15–17 Could 2026.
On the final day, JJ Geewax, Director of Utilized AI at Google DeepMind, shared a framing that captured what I had been studying the onerous approach: we have to cease utilizing LLMs like large drawback solvers.
That resonated with me as a result of it’s such a straightforward lure to fall into. It’s simple to simply give the mannequin the info, schema, enterprise guidelines, edge instances, and the duty to confirm itself. Then get annoyed when the result’s inconsistent.
However for dependable manufacturing programs, the higher sample is often a hybrid. Let the agent deal with the elements that require semantic judgement, and let code deal with the elements that require construction, validation, and management.
I’ll be sharing extra reflections from AI Engineer Singapore and the workshops I attended. The YouTube snippet of JJ’s speech right here.
That’s all from me. I hope this helped, and see you within the subsequent article 🙂

