Within the first publish of this sequence (Agentic AI 101: Beginning Your Journey Constructing AI Brokers), we talked in regards to the fundamentals of making AI Brokers and launched ideas like reasoning, reminiscence, and instruments.
After all, that first publish touched solely the floor of this new space of the info trade. There’s a lot extra that may be executed, and we’re going to be taught extra alongside the best way on this sequence.
So, it’s time to take one step additional.
On this publish, we are going to cowl three subjects:
- Guardrails: these are secure blocks that forestall a Giant Language Mannequin (LLM) from responding about some subjects.
- Agent Analysis: Have you ever ever considered how correct the responses from LLM are? I guess you probably did. So we are going to see the primary methods to measure that.
- Monitoring: We may even be taught in regards to the built-in monitoring app in Agno’s framework.
We will start now.
Guardrails
Our first matter is the best, for my part. Guardrails are guidelines that may hold an AI agent from responding to a given matter or listing of subjects.
I imagine there’s a good probability that you’ve got ever requested one thing to ChatGPT or Gemini and acquired a response like “I can’t speak about this matter”, or “Please seek the advice of an expert specialist”, one thing like that. Normally, that happens with delicate subjects like well being recommendation, psychological circumstances, or monetary recommendation.
These blocks are safeguards to forestall individuals from hurting themselves, harming their well being, or their pockets. As we all know, LLMs are skilled on huge quantities of textual content, ergo inheriting quite a lot of unhealthy content material with it, which might simply result in unhealthy recommendation in these areas for individuals. And I didn’t even point out hallucinations!
Take into consideration what number of tales there are of people that misplaced cash by following funding suggestions from on-line boards. Or how many individuals took the mistaken medication as a result of they examine it on the web.
Properly, I suppose you bought the purpose. We should forestall our brokers from speaking about sure subjects or taking sure actions. For that, we are going to use guardrails.
The perfect framework I discovered to impose these blocks is Guardrails AI [1]. There, you will note a hub stuffed with predefined guidelines {that a} response should comply with as a way to cross and be exhibited to the consumer.
To get began rapidly, first go to this hyperlink [2] and get an API key. Then, set up the bundle. Subsequent, kind the guardrails setup command. It is going to ask you a few questions that you may reply n (for No), and it’ll ask you to enter the API Key generated.
pip set up guardrails-ai
guardrails configure
As soon as that’s accomplished, go to the Guardrails AI Hub [3] and select one that you simply want. Each guardrail has directions on learn how to implement it. Mainly, you put in it by way of the command line after which use it like a module in Python.
For this instance, we’re selecting one known as Limit to Matter [4], which, as its title says, lets the consumer discuss solely about what’s within the listing. So, return to the terminal and set up it utilizing the code beneath.
guardrails hub set up hub://tryolabs/restricttotopic
Subsequent, let’s open our Python script and import some modules.
# Imports
from agno.agent import Agent
from agno.fashions.google import Gemini
import os
# Import Guard and Validator
from guardrails import Guard
from guardrails.hub import RestrictToTopic
Subsequent, we create the guard. We are going to prohibit our agent to speak solely about sports activities or the climate. And we’re proscribing it to speak about shares.
# Setup Guard
guard = Guard().use(
RestrictToTopic(
valid_topics=["sports", "weather"],
invalid_topics=["stocks"],
disable_classifier=True,
disable_llm=False,
on_fail="filter"
)
)
Now we will run the agent and the guard.
# Create agent
agent = Agent(
mannequin= Gemini(id="gemini-1.5-flash",
api_key = os.environ.get("GEMINI_API_KEY")),
description= "An assistant agent",
directions= ["Be sucint. Reply in maximum two sentences"],
markdown= True
)
# Run the agent
response = agent.run("What is the ticker image for Apple?").content material
# Run agent with validation
validation_step = guard.validate(response)
# Print validated response
if validation_step.validation_passed:
print(response)
else:
print("Validation Failed", validation_step.validation_summaries[0].failure_reason)
That is the response once we ask a few inventory image.
Validation Failed Invalid subjects discovered: ['stocks']
If I ask a few matter that’s not on the valid_topics
listing, I may even see a block.
"What is the primary soda drink?"
Validation Failed No legitimate matter was discovered.
Lastly, let’s ask about sports activities.
"Who's Michael Jordan?"
Michael Jordan is a former skilled basketball participant extensively thought-about certainly one of
the best of all time. He gained six NBA championships with the Chicago Bulls.
And we noticed a response this time, as it’s a legitimate matter.
Let’s transfer on to the analysis of brokers now.
Agent Analysis
Since I began learning LLMs and Agentic Ai, certainly one of my predominant questions has been about mannequin analysis. In contrast to conventional Knowledge Science Modeling, the place you will have structured metrics which are ample for every case, for AI Brokers, that is extra blurry.
Luckily, the developer neighborhood is fairly fast to find options for nearly all the pieces, and they also created this good bundle for LLMs analysis: deepeval
.
DeepEval [5] is a library created by Assured AI that gathers many strategies to guage LLMs and AI Brokers. On this part, let’s be taught a few the primary strategies, simply so we will construct some instinct on the topic, and likewise as a result of the library is sort of in depth.
The primary analysis is essentially the most fundamental we will use, and it’s known as G-Eval
. As AI instruments like ChatGPT develop into extra widespread in on a regular basis duties, we now have to verify they’re giving useful and correct responses. That’s the place G-Eval from the DeepEval Python bundle is available in.
G-Eval is sort of a sensible reviewer that makes use of one other AI mannequin to guage how nicely a chatbot or AI assistant is performing. For instance. My agent runs Gemini, and I’m utilizing OpenAI to evaluate it. This methodology takes a extra superior method than a human one by asking an AI to “grade” one other AI’s solutions primarily based on issues like relevance, correctness, and readability.
It’s a pleasant option to check and enhance generative AI programs in a extra scalable approach. Let’s rapidly code an instance. We are going to import the modules, create a immediate, a easy chat agent, and ask it a few description of the climate for the month of Could in NYC.
# Imports
from agno.agent import Agent
from agno.fashions.google import Gemini
import os
# Analysis Modules
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval
# Immediate
immediate = "Describe the climate in NYC for Could"
# Create agent
agent = Agent(
mannequin= Gemini(id="gemini-1.5-flash",
api_key = os.environ.get("GEMINI_API_KEY")),
description= "An assistant agent",
directions= ["Be sucint"],
markdown= True,
monitoring= True
)
# Run agent
response = agent.run(immediate)
# Print response
print(response.content material)
It responds: “Delicate, with common highs within the 60s°F and lows within the 50s°F. Count on some rain“.
Good. Appears fairly good to me.
However how can we put a quantity on it and present a possible supervisor or consumer how our agent is doing?
Right here is how:
- Create a check case passing the
immediate
and theresponse
to theLLMTestCase
class. - Create a metric. We are going to use the tactic
GEval
and add a immediate for the mannequin to check it for coherence, after which I give it the that means of what coherence is to me. - Give the output as
evaluation_params
. - Run the
measure
methodology and get therating
andpurpose
from it.
# Check Case
test_case = LLMTestCase(enter=immediate, actual_output=response)
# Setup the Metric
coherence_metric = GEval(
title="Coherence",
standards="Coherence. The agent can reply the immediate and the response is sensible.",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
)
# Run the metric
coherence_metric.measure(test_case)
print(coherence_metric.rating)
print(coherence_metric.purpose)
The output seems like this.
0.9
The response instantly addresses the immediate about NYC climate in Could,
maintains logical consistency, flows naturally, and makes use of clear language.
Nonetheless, it could possibly be barely extra detailed.
0.9 appears fairly good, provided that the default threshold is 0.5.
If you wish to verify the logs, use this subsequent snippet.
# Examine the logs
print(coherence_metric.verbose_logs)
Right here’s the response.
Standards:
Coherence. The agent can reply the immediate and the response is sensible.
Analysis Steps:
[
"Assess whether the response directly addresses the prompt; if it aligns,
it scores higher on coherence.",
"Evaluate the logical flow of the response; responses that present ideas
in a clear, organized manner rank better in coherence.",
"Consider the relevance of examples or evidence provided; responses that
include pertinent information enhance their coherence.",
"Check for clarity and consistency in terminology; responses that maintain
clear language without contradictions achieve a higher coherence rating."
]
Very good. Now allow us to find out about one other fascinating use case, which is the analysis of process completion for AI Brokers. Elaborating a bit of extra, how our agent is doing when it’s requested to carry out a process, and the way a lot of it the agent can ship.
First, we’re making a easy agent that may entry Wikipedia and summarize the subject of the question.
# Imports
from agno.agent import Agent
from agno.fashions.google import Gemini
from agno.instruments.wikipedia import WikipediaTools
import os
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import TaskCompletionMetric
from deepeval import consider
# Immediate
immediate = "Search wikipedia for 'Time sequence evaluation' and summarize the three details"
# Create agent
agent = Agent(
mannequin= Gemini(id="gemini-2.0-flash",
api_key = os.environ.get("GEMINI_API_KEY")),
description= "You're a researcher specialised in looking out the wikipedia.",
instruments= [WikipediaTools()],
show_tool_calls= True,
markdown= True,
read_tool_call_history= True
)
# Run agent
response = agent.run(immediate)
# Print response
print(response.content material)
The consequence seems excellent. Let’s consider it utilizing the TaskCompletionMetric
class.
# Create a Metric
metric = TaskCompletionMetric(
threshold=0.7,
mannequin="gpt-4o-mini",
include_reason=True
)
# Check Case
test_case = LLMTestCase(
enter=immediate,
actual_output=response.content material,
tools_called=[ToolCall(name="wikipedia")]
)
# Consider
consider(test_cases=[test_case], metrics=[metric])
Output, together with the agent’s response.
======================================================================
Metrics Abstract
- ✅ Job Completion (rating: 1.0, threshold: 0.7, strict: False,
analysis mannequin: gpt-4o-mini,
purpose: The system efficiently looked for 'Time sequence evaluation'
on Wikipedia and offered a transparent abstract of the three details,
absolutely aligning with the consumer's objective., error: None)
For check case:
- enter: Search wikipedia for 'Time sequence evaluation' and summarize the three details
- precise output: Listed here are the three details about Time sequence evaluation primarily based on the
Wikipedia search:
1. **Definition:** A time sequence is a sequence of knowledge factors listed in time order,
typically taken at successive, equally spaced cut-off dates.
2. **Functions:** Time sequence evaluation is utilized in numerous fields like statistics,
sign processing, econometrics, climate forecasting, and extra, wherever temporal
measurements are concerned.
3. **Goal:** Time sequence evaluation includes strategies for extracting significant
statistics and traits from time sequence information, and time sequence forecasting
makes use of fashions to foretell future values primarily based on previous observations.
- anticipated output: None
- context: None
- retrieval context: None
======================================================================
General Metric Cross Charges
Job Completion: 100.00% cross price
======================================================================
✓ Checks completed 🎉! Run 'deepeval login' to avoid wasting and analyze analysis outcomes
on Assured AI.
Our agent handed the check with honor: 100%!
You may be taught rather more in regards to the DeepEval library on this hyperlink [8].
Lastly, within the subsequent part, we are going to be taught the capabilities of Agno’s library for monitoring brokers.
Agent Monitoring
Like I instructed you in my earlier publish [9], I selected Agno to be taught extra about Agentic AI. Simply to be clear, this isn’t a sponsored publish. It’s simply that I believe that is the best choice for these beginning their journey studying about this matter.
So, one of many cool issues we will make the most of utilizing Agno’s framework is the app they make obtainable for mannequin monitoring.
Take this agent that may search the web and write Instagram posts, for instance.
# Imports
import os
from agno.agent import Agent
from agno.fashions.google import Gemini
from agno.instruments.file import FileTools
from agno.instruments.googlesearch import GoogleSearchTools
# Matter
matter = "Wholesome Consuming"
# Create agent
agent = Agent(
mannequin= Gemini(id="gemini-1.5-flash",
api_key = os.environ.get("GEMINI_API_KEY")),
description= f"""You're a social media marketer specialised in creating partaking content material.
Search the web for 'trending subjects about {matter}' and use them to create a publish.""",
instruments=[FileTools(save_files=True),
GoogleSearchTools()],
expected_output="""A brief publish for instagram and a immediate for an image associated to the content material of the publish.
Do not use emojis or particular characters within the publish. In the event you discover an error within the character encoding, take away the character earlier than saving the file.
Use the template:
- Put up
- Immediate for the image
Save the publish to a file named 'publish.txt'.""",
show_tool_calls=True,
monitoring=True)
# Writing and saving a file
agent.print_response("""Write a brief publish for instagram with suggestions and tips that positions me as
an authority in {matter}.""",
markdown=True)
To watch its efficiency, comply with these steps:
- Go to https://app.agno.com/settings and get an API Key.
- Open a terminal and sort
ag setup
. - If it’s the first time, it’d ask for the API Key. Copy and Paste it within the terminal immediate.
- You will note the Dashboard tab open in your browser.
- If you wish to monitor your agent, add the argument
monitoring=True
. - Run your agent.
- Go to the Dashboard on the net browser.
- Click on on Classes. As it’s a single agent, you will note it beneath the tab Brokers on the highest portion of the web page.

The cools options we will see there are:
- Data in regards to the mannequin
- The response
- Instruments used
- Tokens consumed

Fairly neat, huh?
That is helpful for us to know the place the agent is spending kind of tokens, and the place it’s taking extra time to carry out a process, for instance.
Properly, let’s wrap up then.
Earlier than You Go
We’ve got realized so much on this second spherical. On this publish, we lined:
- Guardrails for AI are important security measures and moral tips applied to forestall unintended dangerous outputs and guarantee accountable AI conduct.
- Mannequin analysis, exemplified by
GEval
for broad evaluation andTaskCompletion
with DeepEval for brokers output high quality, is essential for understanding AI capabilities and limitations. - Mannequin monitoring with Agno’s app, together with monitoring token utilization and response time, which is important for managing prices, making certain efficiency, and figuring out potential points in deployed AI programs.
Contact & Observe Me
In the event you favored this content material, discover extra of my work in my web site.
GitHub Repository
https://github.com/gurezende/agno-ai-labs
References
[1. Guardrails Ai] https://www.guardrailsai.com/docs/getting_started/guardrails_server
[2. Guardrails AI Auth Key] https://hub.guardrailsai.com/keys
[3. Guardrails AI Hub] https://hub.guardrailsai.com/
[4. Guardrails Restrict to Topic] https://hub.guardrailsai.com/validator/tryolabs/restricttotopic
[5. DeepEval.] https://www.deepeval.com/docs/getting-started
[6. DataCamp – DeepEval Tutorial] https://www.datacamp.com/tutorial/deepeval
[7. DeepEval. TaskCompletion] https://www.deepeval.com/docs/metrics-task-completion
[8. Llm Evaluation Metrics: The Ultimate LLM Evaluation Guide] https://www.confident-ai.com/weblog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
[9. Agentic AI 101: Starting Your Journey Building AI Agents] https://towardsdatascience.com/agentic-ai-101-starting-your-journey-building-ai-agents/