Introduction
an agentic AI community for my firm that advises manufacturing crops on find out how to mature their operations. The system was designed to be data-driven, permitting customers to add evaluation knowledge instantly via the chat interface. The primary working prototype was completed surprisingly rapidly, and at first look the outcomes seemed promising.
There was just one drawback: A lot of the outcomes had been unsuitable!
Even worse, the AI rapidly discovered which numerical ranges seemed believable and started producing convincing — however fabricated — outputs. Mixed with the eloquent language technology of the LLM, these outcomes might simply be mistaken for reality. And this habits was not restricted to a single mannequin. Comparable patterns appeared throughout all examined methods: ChatGPT, Gemini Enterprise, DIA Mind, and Microsoft Copilot.
However, believable knowledge is just not sufficient, Enterprise AI methods require dependable knowledge!
Additional investigation revealed recurring failure modes. Even with “Code Interpreter” enabled, the methods:
- skipped rows or columns,
- utilized incorrect filters,
- returned equivalent outcomes for various inputs,
- silently blended elements of the dataset,
- or just collapsed below extra advanced analytical duties.
This led to an important realization:
Probabilistic reasoning is extraordinarily highly effective for interpretation and interplay — however foundational knowledge evaluation requires deterministic execution.
Desk of contents
1 The Use Case
2 The Hybrid Structure
3 The Evaluation Planner
4 The Evaluation Engine
5 An Finish-to-Finish instance
6 Why AI Structure Issues
1 The Use Case
Though the precise use case is of secondary significance, it’s briefly outlined right here to help the sensible understanding of the underlying architectural problem.
The first activity of our agent is to advise manufacturing crops and worth streams on find out how to enhance their operational maturity: optimizing processes, enhancing productiveness, decreasing stock ranges, and in the end reducing operational prices. To attain this, the session agent operates in two modes:
- It supplies generic suggestions for enhancing particular operational matters primarily based on the retrieval of specialised “how-to” documentation and evaluation questionnaires.
- The agent is meant to research the present state of affairs of a plant or worth stream primarily based on evaluation outcomes and assessors’ written suggestions. Primarily based on this evaluation, it’s anticipated to offer extremely particular suggestions for the following enchancment steps.
In each modes — as with most LLM-based AI fashions — the person can interactively talk about concepts and proposals with the agent with a purpose to derive probably the most appropriate motion plan.
For the second operation mode, it’s important that the agent can reliably course of and analyze evaluation knowledge. In our case, this knowledge is offered as an Excel export from a central database. Ideally, the agent ought to have the ability to course of the file with none prior handbook preparation.
The construction of the file, nonetheless, is difficult. Since all evaluation outcomes, intermediate calculations, metadata, and detailed evaluation questions are saved in separate columns, the worksheet incorporates greater than 800 columns. The variety of rows corresponds to the variety of assessments within the database and might vary from one to a number of tons of (Fig. 1). Evaluation scores are represented as integers from 0 to 4. As well as, the file incorporates greater than 160 free-text fields with qualitative observations, strengths, weaknesses, and proposals from the assessors.

The analytical duties of the agent embody filtering related rows and columns for a particular request, calculating averages, aggregating maturity scores, summarizing textual suggestions, and deriving significant enchancment solutions from the outcomes.
Initially, these duties gave the impression to be effectively throughout the capabilities of recent LLM-based AI methods, particularly with “Code interpreter” mode enabled. As already talked about within the introduction, this assumption rapidly turned out to be a false impression.
2 The Hybrid Structure
The core thought for overcoming the analytical problem was to obviously separate deterministic knowledge evaluation from LLM-based reasoning and interpretation. Fig. 2 reveals the chosen system structure after a number of enchancment iterations. The system was applied in Microsoft Copilot Studio as a result of the platform permits deterministic workflow parts, equivalent to matters and flows, to be mixed with LLM-based reasoning parts.

The dad or mum agent handles all communication with the person. It orchestrates the sub brokers and the analytics module, delegates duties to them, receives their responses, and composes the ultimate reply.
The sub brokers are specialised LLM-based modules with entry to particular information sources. These embody descriptions of maturity-level expectations for the worth streams, questionnaires with detailed evaluation questions, and extra normal tips for operational excellence. The sub brokers are known as by the dad or mum agent in keeping with their particular capabilities and reply to the dad or mum agent somewhat than on to the person.
The analytics module is the principle focus of this text. It performs the deterministic knowledge evaluation and is designed to offer reproducible and dependable analytical outcomes. It receives an evaluation instruction in pure language from the dad or mum agent, known as Parent_Instruction. The analytics module itself consists of matters, flows, and AI modules, that are known as “prompts” in Copilot Studio.
The subject T_receive_Excel_File handles the add and storage of evaluation recordsdata. It’s triggered when a file is uploaded within the chat window, indicated by the variable System.Exercise.Attachments having a worth. The subject checks whether or not the uploaded file is an Excel file and, in that case, shops it within the international variable Assessment_File.
The subject T_analyze_assessments is actively known as by the dad or mum agent if it has an analytics activity to conduct and receives Parent_Instruction as enter. A second enter is the evaluation knowledge saved within the international variable Assessment_File. The subject incorporates the 2 core analytics parts: Analysis_Planner and Analysis_Engine. Each are embedded in agentic flows, F_Call_Analysis_Planner and F_Call_Analysis_Engine. These flows function connectors between the subject T_analyze_assessments and the AI prompts P_Analysis_Planner and P_Analysis_Engine.
F_Call_Analysis_Planner receives just one enter, Parent_Instruction, and forwards it to P_Analysis_Planner. This part generates the Selection_Rule, the core evaluation instruction to be executed by P_Analysis_Engine. The inside workings of P_Analysis_Planner are mentioned in Chapter 3.
F_Call_Analysis_Engine receives three inputs: the Selection_Rule from Analysis_Planner, a Mapping_File offered from SharePoint, and the Assessment_File. All three inputs are forwarded to the AI immediate P_Analysis_Engine, which conducts the information evaluation as specified by Analysis_Planner. The P_Analysis_Engine is mentioned intimately in Chapter 4.
3 The Evaluation Planner
The P_Analysis_Planner is the clever a part of the information evaluation pipeline and generates the evaluation instruction, known as Selection_Rule. This instruction is a translation of the pure language Parent_Instruction and is mostly distinctive for every request. With a view to reduce probabilistic variation, the interpretation course of is constrained by strict guidelines.
The Analysis_Planner doesn’t analyze the evaluation knowledge itself. Its sole accountability is to translate the probabilistic Parent_Instruction right into a deterministic evaluation specification.
Within the following, we’ll look at chosen elements of the instruction in additional element. You’ll be able to obtain the complete instruction right here.
You might be Analysis_Planner, an skilled assistant for translating natural-language evaluation evaluation requests into structured Selection_Rules.
Your activity is to create a Selection_Rule JSON object for the Analysis_Engine.
You obtain just one enter:
1. Parent_Instruction :
A natural-language evaluation request from the dad or mum agent (orchestrator).
You have to analyze Parent_Instruction and decide:
- which sort of study is required,
- which evaluation content material classes are related,
- whether or not idea or execution maturity/findings are requested,
- whether or not particular chapters are requested,
- and whether or not row filters are required.
The Selection_Rule you generate will later be utilized by the Analysis_Engine along with:
- the actual evaluation knowledge file,
- and the Mapping_File
to execute the evaluation deterministically.
The code field above reveals the preliminary instruction for P_Analysis_Planner. It clearly defines goal and scope and explicitly separates planning from execution. The planner interprets the request, whereas the precise execution is delegated to the P_Analysis_Engine.
Subsequent follows an extended part describing the semantics of the evaluation knowledge. After all, this half is very particular to the person use case and dataset. It defines semantic classes used for row filtering and classes used to pick the precise evaluation targets (TARGET CONTENT CATEGORIES and TARGET SELECTION ATTRIBUTES).
ASSESSMENT DATA SEMANTICS
The evaluation knowledge might be addressed via the next semantic classes.
ROW FILTER CATEGORIES
Use these classes just for row_filters:
- VS_Nr:
Distinctive identifier of the worth stream.
Use when filtering by worth stream quantity.
- Worth Stream:
Title of the worth stream.
Use when filtering by worth stream identify.
- ...
TARGET CONTENT CATEGORIES
Use these classes solely in target_selection_rules.data_category:
- chapter_score:
Numeric maturity rating.
Use for maturity calculations, rating evaluation, and common maturity evaluation.
- power:
Assessor statements describing strengths.
- ...
TARGET SELECTION ATTRIBUTES
Use these attributes solely inside target_selection_rules:
- data_category:
Defines which goal content material class is required.
- aggregation_allowed:
Use:
- imply for numeric maturity averages
- abstract for textual summaries
- ...
The planner by no means interacts instantly with bodily dataset columns. As an alternative, it operates on a semantic abstraction layer that decouples pure language from the underlying dataset construction.
This separation is necessary as a result of the evaluation dataset incorporates greater than 800 columns, together with:
- maturity scores,
- textual assessor findings,
- metadata,
- organizational mappings,
- questionnaire variants,
- and idea/execution distinctions.
Choosing the right goal columns subsequently turns into a important a part of the evaluation course of.
Limiting the allowed evaluation sorts is equally necessary. The planner is deliberately prevented from inventing arbitrary analytical operations. The part ANALYSIS TYPES subsequently defines the one legitimate evaluation sorts — at present simply two. This considerably improves the predictability and robustness of downstream execution. After all, the listing can simply be prolonged for particular person use circumstances.
ANALYSIS TYPES
Use precisely one in all these analysis_type values:
- numeric_mean
Use for:
- common maturity
- imply maturity
- ...
- text_summary
Use for:
- strengths
- enchancment potentials
- ...
The subsequent part defines how the planner selects the related goal columns in an summary and deterministic method. The principles distinguish between the 2 predefined evaluation sorts numeric_mean and text_summary and eventually decide which dataset columns are chosen for a particular request.
RULES FOR target_selection_rules
NUMERIC MATURITY ANALYSIS
For numeric maturity evaluation:
- analysis_type have to be:
"numeric_mean"
- data_category have to be:
["chapter_score"]
- ...
TEXT SUMMARY ANALYSIS
For textual abstract evaluation:
- analysis_type have to be:
"text_summary"
- data_category:
embody solely requested classes:
- "power"
- "potential"
- "advice"
- "comment"
- ...
An identical logic applies to the row filtering course of.
RULES FOR row_filters
Use row_filters just for filtering rows within the evaluation dataset.
Allowed row filter keys are:
- VS_Nr
- Worth Stream
- ...
Do NOT use row_filters for:
- chapter_id
- ...
These belong solely to target_selection_rules.
Lastly, the instruction defines the required output construction along with a number of strict “do-not guidelines”. This part is especially necessary as a result of the generated output is instantly forwarded to the P_Analysis_Engine and subsequently should observe a clearly outlined and machine-readable construction.
OUTPUT FORMAT
Return solely legitimate JSON.
Don't return markdown.
Don't return Python code.
...
Use precisely this construction:
{
"standing": "success",
"parent_instruction_summary": "",
"selection_rule": {
"analysis_type": "",
"target_selection_rules": {
"data_category": [],
"aggregation_allowed": [],
"concept_execution": null,
"chapter_id": null
},
"row_filters": {}
},
"warnings": []
}
If the request is unclear, the planner should explicitly return an error construction as an alternative of “guessing” a doubtlessly unsuitable evaluation instruction.
If the duty is unclear, return:
{
"standing": "error",
"parent_instruction_summary": "",
"selection_rule": {
"analysis_type": null,
"target_selection_rules": {
"data_category": [],
"aggregation_allowed": [],
"concept_execution": null,
"chapter_id": null
},
"row_filters": {}
},
"warnings": [
"The analysis task is not clearly understood."
]
}
At this level, the planner has remodeled ambiguous pure language right into a deterministic evaluation specification. Nonetheless, the precise knowledge execution nonetheless has not occurred.
In chapter 5, we’ll observe an actual person request via the whole pipeline and look at how P_Analysis_Planner generates the Selection_Rule and the way P_Analysis_Engine executes it on the evaluation dataset.
4 The Evaluation Engine
In contrast to the P_Analysis_Planner, the P_Analysis_Engine doesn’t purpose concerning the activity. It solely executes the evaluation specification generated by P_Analysis_Planner.
As in chapter 3, we’ll focus solely on probably the most related elements of the instruction. The total specification might be downloaded right here.
The instruction of P_Analysis_Engine begins with the essential activity definition. In essence, the AI immediate is used as a managed Python execution atmosphere. The code is predefined within the immediate instruction and should solely be executed, not modified.
You might be Analysis_Engine, a deterministic pandas-based evaluation executor.
Your activity is to research an Excel evaluation dataset utilizing Code Interpreter.
You obtain three inputs:
1. doc
The Excel file containing the evaluation knowledge.
2. Mapping_File
The Excel file describing the columns of doc.
3. Selection_Rule
A JSON object that defines:
- which columns to pick from Mapping_File
- which row filters to use to doc
- which sort of study to carry out
You have to not reinterpret the unique person request.
You have to not infer extra columns.
You have to not change Selection_Rule.
You have to not generate a brand new evaluation method.
You have to solely execute the deterministic Python script under.
Use Code Interpreter to execute the Python script.
Return solely the JSON outcome printed by the script.
Don't return markdown.
Don't clarify the code.
Don't add textual content earlier than or after the JSON outcome.
P_Analysis_Engine receives three enter recordsdata:
- The
Assessment_Fileuploaded from the person within the chat interface. It’s saved within the prompt-internal variabledoc. - A
Mapping_Filewhich the moveF_Call_Analysis_Enginemasses from SharePoint in preparation of the execution. - The
Selection_Rulegenerated byP_Analysis_Planner(see chapter 3).
The Mapping_File performs an important function in defining the semantics of the various columns in Assessment_File on a better degree of abstraction. With this abstraction layer, the Selection_Rule solely must specify which sort of data is required, whereas the P_Analysis_Engine selects the corresponding dataset columns throughout execution.

Mapping_File | picture by creatorFig. 3 reveals the construction of Mapping_File. It incorporates a row for every column of Assessment_File, that’s doubtlessly related for the information evaluation. Information columns which might be clearly irrelevant will not be represented in Mapping_File and subsequently will not be seen to P_Analysis_Engine. For every row the file specifies the choice standards:
data_category:
Useful which means of the column, e.g. maturity rating, power, plant identify, area, or season.chapter_id:
Distinctive identifier of the evaluation chapter.chapter_name:
Human-readable identify of the evaluation chapter.concept_execution:
Signifies whether or not the column belongs to idea or execution maturity.aggregation_allowed:
Defines which sort of aggregation is legitimate for the column, e.g.implyfor numeric maturity scores orabstractfor textual findings.
Subsequent in P_Analysis_Engine’s instruction comes a paragraph about find out how to interpret the Selection_Rule.
Guidelines for Selection_Rule:
- analysis_type = "numeric_mean":
Calculate arithmetic means for all chosen numeric goal columns.
- analysis_type = "text_summary":
Accumulate non-empty textual content entries from all chosen textual content goal columns.
- target_selection_rules:
Choose goal columns by matching Mapping_File attributes.
A rule worth of null means: don't filter by this attribute.
A listing means: hold rows the place the Mapping_File attribute is within the listing.
- row_filters:
Apply row filters to doc.
Keys are data_category values from Mapping_File, equivalent to "Plant", "Area", "Manufacturing Precept", "Season".
Values are lists of accepted values.
The choice specifies:
- which evaluation operation have to be executed (
analysis_type), - how related goal columns are chosen from the
Mapping_File(target_selection_rules), - and the way the evaluation dataset is filtered earlier than the evaluation is carried out (
row_filters).
This instruction is deliberately deterministic. The P_Analysis_Engine is just not allowed to reinterpret the unique person request or invent extra analytical operations.
After the instruction block, the P_Analysis_Engine receives the precise Python script. The total script incorporates greater than 300 traces of code and is a part of the AI immediate instruction. It’s linked on the high of this chapter and might be downloaded. Lots of the code traces will not be conceptually necessary for the structure. They deal with sensible robustness: cleansing column names, normalizing enter values, dealing with lacking columns, changing Copilot wrapper objects, and returning structured error messages.
For the article, I’ll focus solely on the central logic.
The primary necessary step is that the engine masses the uploaded evaluation knowledge (now out there in doc) and the Mapping_File. From this level on, the LLM is now not decoding the person request. It solely executes the deterministic script primarily based on the Selection_Rule.
mapping_df = pd.read_excel(Mapping_File)
data_df = pd.read_excel(doc)
mapping_df = strip_column_names(mapping_df)
data_df = strip_column_names(data_df)
The important thing architectural factor is the collection of goal columns. The P_Analysis_Engine by no means guesses which Excel columns could also be related. As an alternative, it filters the Mapping_File in keeping with the attributes outlined in target_selection_rules.
target_mapping = mapping_df.copy()
for attr, rule_value in target_selection_rules.objects():
values = normalize_rule_value(rule_value)
values = normalize_list_for_matching(values)
if values is None:
proceed
target_mapping = target_mapping[
target_mapping[attr]
.apply(normalize_for_matching)
.isin(values)
]
selected_target_columns = (
target_mapping["source_column_name"]
.dropna()
.tolist()
)
That is the purpose the place the summary evaluation instruction turns into concrete. For instance, a rule equivalent to chapter_id = ["3.5"], data_category = ["chapter_score"], and aggregation_allowed = ["mean"] is translated into the precise Excel columns containing the Idea and Execution maturity scores for chapter 3.5.
The identical precept is utilized to row filters. Once more, the engine doesn’t infer something from pure language. It solely applies the filters explicitly offered within the Selection_Rule.
filtered_df = data_df.copy()
for filter_category, filter_values in row_filters.objects():
filter_mapping = mapping_df[
mapping_df["data_category"]
.apply(normalize_for_matching)
== normalize_for_matching(filter_category)
]
filter_col = filter_mapping["source_column_name"].iloc[0]
filtered_df = filtered_df[
filtered_df[filter_col]
.apply(normalize_for_matching)
.isin(values)
]
After column choice and row filtering, the precise evaluation logic turns into deliberately easy. For numeric maturity evaluation, the engine calculates arithmetic means for all chosen numeric goal columns.
if analysis_type == "numeric_mean":
numeric_result = {}
for col in available_target_columns:
sequence = pd.to_numeric(filtered_df[col], errors="coerce")
valid_count = int(sequence.notna().sum())
numeric_result[col] = {
"imply": float(sequence.imply()) if valid_count > 0 else None,
"valid_count": valid_count
}
outcome["result"] = numeric_result
For textual evaluation, the engine collects non-empty assessor statements as an alternative of calculating values.
elif analysis_type == "text_summary":
text_result = {}
for col in available_target_columns:
values = [
clean_text_value(v)
for v in filtered_df[col].tolist()
]
values = [v for v in values if v is not None]
text_result[col] = {
"entries": values,
"entry_count": len(values)
}
outcome["result"] = text_result
Lastly, the result’s returned as JSON. That is necessary as a result of the output is just not but the ultimate user-facing reply. It’s the dependable analytical basis for the following LLM step: interpretation from dad or mum agent.
print(json.dumps(outcome, indent=2, ensure_ascii=False))
This design intentionally retains the P_Analysis_Engine “boring”. It doesn’t purpose, it doesn’t clarify, and it doesn’t enhance the evaluation. It solely executes. And that’s precisely the purpose. The extra deterministic this layer is, the extra belief might be positioned within the later LLM-generated interpretation.
5 Finish-to-Finish Instance
For example the whole workflow, allow us to observe a sensible instance via the complete pipeline.
Triggered by the person interplay, the dad or mum agent would possibly elevate the next Parent_Instruction to the analytics module:
“Summarize the principle enchancment potentials for chapter 1.4 Failure Prevention System in plant AbcP.”
The request appears easy for a human reader, nevertheless it already incorporates a number of semantic duties:
- establish the requested evaluation chapters,
- detect the requested content material kind,
- apply a row filter,
- retrieve the right textual content columns,
- combination textual findings,
- and eventually generate a significant interpretation ( → dad or mum agent).
That is precisely the kind of activity the place a pure LLM-based evaluation turns into unreliable. The system subsequently separates the workflow into deterministic execution steps and probabilistic interpretation steps.
5.1 Translation from Evaluation Planner
Step one is carried out by P_Analysis_Planner.
It interprets the pure language request right into a deterministic Selection_Rule.
{
"standing": "success",
"parent_instruction_summary": "Summarize enchancment potentials for chapter 1.4 Failure Prevention System in plant AbcP.",
"selection_rule": {
"analysis_type": "text_summary",
"target_selection_rules": {
"data_category": ["potential"],
"aggregation_allowed": ["summary"],
"concept_execution": null,
"chapter_id": ["1.4"]
},
"row_filters": {
"Plant": ["AbcP"]
}
},
"warnings": []
}
The Selection_Rule already incorporates the whole deterministic evaluation specification:
analysis_type = "text_summary"
signifies that textual assessor findings have to be collected as an alternative of numeric calculations.data_category = ["potential"]
restricts the evaluation to enchancment potentials.chapter_id = ["1.4"]
limits the evaluation to the Failure Prevention System chapter.row_filters = {"Plant": ["AbcP"]}
restricts the dataset to the requested plant.
At this stage, no knowledge evaluation has occurred but. The result’s solely an execution instruction for the following step.
5.2 Execution from Evaluation Engine
This Selection_Rule is handed over to P_Analysis_Engine for execution. First, the engine selects all matching goal columns from the Mapping_File.
target_mapping = target_mapping[
target_mapping[attr]
.apply(normalize_for_matching)
.isin(values)
]
This interprets the summary choice standards into actual dataset columns, for instance:
selected_target_columns = [
"1.4 CON L2 Improvement potentials",
"1.4 CON L3 Improvement potentials",
"1.4 EXE L2 Improvement potentials",
"1.4 EXE L3 Improvement potentials"
]
Subsequent, the row filters are utilized:
filtered_df = filtered_df[
filtered_df[filter_col]
.apply(normalize_for_matching)
.isin(values)
]
On this instance, the dataset is decreased to evaluation rows belonging to plant AbcP.
Lastly, the engine collects all non-empty textual content entries from the chosen columns.
values = [
clean_text_value(v)
for v in filtered_df[col].tolist()
]
values = [v for v in values if v is not None]
As we will see, the engine doesn’t interpret the findings. It solely retrieves and buildings them in keeping with the Python script.
The engine’s output is a group of assessors’ written statements concerning the values stream’s enchancment potentials as a JSON object.
{
"entry_count": 6,
"entries": [
"Root causes are not systematically tracked.",
"Escalation rules for recurring failures are unclear.",
"Lessons learned are not transferred between shifts.",
"Preventive maintenance findings are not integrated into CIP activities.",
"Failure trends are visualized inconsistently.",
"Problem-solving activities focus mainly on symptoms instead of root causes."
]
}
At this level, the system has nonetheless not generated any suggestions. It has solely produced a dependable assortment of related evaluation findings. This JSON object is returned to the dad or mum agent for interpretation and technology of the ultimate response to the person.
5.3 Interpretation from Mum or dad Agent
Within the closing step, the dad or mum agent collects all responses (doubtlessly extra responses from the sub brokers) and generates the ultimate output.
The collected findings point out that the Failure Prevention System is
at present extra reactive than preventive. Most gaps are associated to lacking
systematic root-cause administration and weak organizational studying throughout
shifts and groups. The best leverage enhancements would doubtless come from
strengthening escalation routines, integrating preventive upkeep findings
into CIP actions, and establishing constant cross-shift studying
mechanisms.
To summarize the central architectural thought of the system:
The LLM now not creates the analytical basis itself. As an alternative, it interprets a deterministic set of already validated findings.
The probabilistic reasoning functionality of the LLM is used the place it creates worth: interpretation, prioritization, rationalization, and communication — not knowledge processing itself.
6 Why AI Structure Issues
Massive Language Fashions are naturally robust at interpretation, reasoning, and language technology, however nonetheless weak at dependable numerical analytics. Their optimization goal is plausibility, not deterministic reproducibility. Even with extensions equivalent to “Code Interpreter”, this weak spot stays seen in additional advanced analytical eventualities.
The excellent news is that this limitation can largely be compensated via clever system structure. The secret is a transparent separation of duties: deterministic data-processing layers execute the analytical basis, whereas LLMs concentrate on interpretation, prioritization, rationalization, and communication.
Within the introduced method, crucial design resolution was subsequently not including extra AI to the system. It was defining very rigorously the place probabilistic reasoning ought to finish and deterministic execution ought to start.
Dependable agentic methods will doubtless require precisely these sorts of hybrid architectures: combining the robustness of classical knowledge science pipelines with the inference capabilities of Massive Language Fashions.

