At Amazon.ae, we serve roughly 10 million prospects month-to-month throughout 5 nations within the Center East and North Africa area—United Arab Emirates (UAE), Saudi Arabia, Egypt, Türkiye, and South Africa. Our AMET (Africa, Center East, and Türkiye) Funds group manages cost picks, transactions, experiences, and affordability options throughout these numerous nations, publishing on common 5 new options month-to-month. Every function requires complete take a look at case technology, which historically consumed 1 week of handbook effort per mission. Our high quality assurance (QA) engineers spent this time analyzing enterprise requirement paperwork (BRDs), design paperwork, UI mocks, and historic take a look at preparations—a course of that required one full-time engineer yearly merely for take a look at case creation.
To enhance this handbook course of, we developed SAARAM (QA Lifecycle App), a multi-agent AI answer that helps scale back take a look at case technology from 1 week to hours. Utilizing Amazon Bedrock with Claude Sonnet by Anthropic and the Strands Brokers SDK, we lowered the time wanted to generate take a look at instances from 1 week to mere hours whereas additionally enhancing take a look at protection high quality. Our answer demonstrates how finding out human cognitive patterns, quite than optimizing AI algorithms alone, can create production-ready methods that improve quite than change human experience.
On this publish, we clarify how we overcame the restrictions of single-agent AI methods by way of a human-centric method, carried out structured outputs to considerably scale back hallucinations and constructed a scalable answer now positioned for enlargement throughout the AMET QA group and later throughout different QA groups in Worldwide Rising Shops and Funds (IESP) Org.
Resolution overview
The AMET Funds QA group validates code deployments affecting cost performance for thousands and thousands of consumers throughout numerous regulatory environments and cost strategies. Our handbook take a look at case technology course of added turnaround time (TAT) within the product cycle, consuming worthwhile engineering sources on repetitive take a look at prep and documentation duties quite than strategic testing initiatives. We would have liked an automatic answer that might keep our high quality requirements whereas lowering the time funding.
Our goals included lowering take a look at case creation time from 1 week to below a number of hours, capturing institutional information from skilled testers, standardizing testing approaches throughout groups, and minimizing the hallucination points widespread in AI methods. The answer wanted to deal with advanced enterprise necessities spanning a number of cost strategies, regional laws, and buyer segments whereas producing particular, actionable take a look at instances aligned with our current take a look at administration methods.
The structure employs a complicated multi-agent workflow. To realize this, we went by way of 3 completely different iterations and proceed to enhance and improve as new strategies are developed and new fashions are deployed.
The problem with conventional AI approaches
Our preliminary makes an attempt adopted typical AI approaches, feeding whole BRDs to a single AI agent for take a look at case technology. This technique ceaselessly produced generic outputs like “confirm cost works accurately” as an alternative of the precise, actionable take a look at instances our QA group requires. For instance, we’d like take a look at instances as particular as “confirm that when a UAE buyer selects money on supply (COD) for an order above 1,000 AED with a saved bank card, the system shows the COD price of 11 AED and processes the cost by way of the COD gateway with order state transitioning to ‘pending supply.’”
The only-agent method offered a number of crucial limitations. Context size restrictions prevented processing massive paperwork successfully, however the lack of specialised processing phases meant the AI couldn’t perceive testing priorities or risk-based approaches. Moreover, hallucination points created irrelevant take a look at eventualities that might mislead QA efforts. The basis trigger was clear: AI tried to compress advanced enterprise logic with out the iterative considering course of that skilled testers make use of when analyzing necessities.
The next stream chart illustrates our points when making an attempt to make use of a single agent with a complete immediate.
The human-centric breakthrough
Our breakthrough got here from a elementary shift in method. As an alternative of asking, “How ought to AI take into consideration testing?”, we requested, “How do skilled people take into consideration testing?” to concentrate on following a particular step-by-step course of as an alternative of counting on the massive language mannequin (LLM) to comprehend this by itself. This philosophy change led us to conduct analysis interviews with senior QA professionals, finding out their cognitive workflows intimately.
We found that skilled testers don’t course of paperwork holistically—they work by way of specialised psychological phases. First, they analyze paperwork by extracting acceptance standards, figuring out buyer journeys, understanding UX necessities, mapping product necessities, analyzing consumer knowledge, and assessing workstream capabilities. Then they develop checks by way of a scientific course of: journey evaluation, situation identification, knowledge stream mapping, take a look at case growth, and at last, group and prioritization.
We then decomposed our authentic agent into sequential considering actions that served as particular person steps. We constructed and examined every step utilizing Amazon Q Developer for CLI to ensure fundamental concepts have been sound and integrated each main and secondary inputs.
This perception led us to design SAARAM with specialised brokers that mirror these professional testing approaches. Every agent focuses on a particular side of the testing course of, resembling how human specialists mentally compartmentalize completely different evaluation phases.
Multi-agent structure with Strands Brokers
Primarily based on our understanding of human QA workflows, we initially tried to construct our personal brokers from scratch. We needed to create our personal looping, serial, or parallel execution. We additionally created our personal orchestration and workflow graphs, which demanded appreciable handbook effort. To handle these challenges, we migrated to Strands Brokers SDK. This supplied the multi-agent orchestration capabilities important for coordinating advanced, interdependent duties whereas sustaining clear execution paths, serving to enhance our efficiency and scale back our growth time.
Workflow iteration 1: Finish-to-end take a look at technology
Our first iteration of SAARAM consisted of a single enter and created our first specialised brokers. It concerned processing a piece doc by way of 5 specialised brokers to generate complete take a look at protection.
Agent 1 known as the Buyer Section Creator, and it focuses on buyer segmentation evaluation, utilizing 4 subagents:
- Buyer Section Discovery identifies product consumer segments
- Resolution Matrix Generator creates parameter-based matrices
- E2E State of affairs Creation develops end-to-end (E2E) eventualities per phase
- Take a look at Steps Technology detailed take a look at case growth
Agent 2 known as the Consumer Journey Mapper, and it employs 4 subagents to map product journeys comprehensively:
- The Circulation Diagram and Sequence Diagram are creators utilizing Mermaid syntax.
- The E2E Eventualities generator builds upon these diagrams.
- The Take a look at Steps Generator is used for detailed take a look at documentation.
Agent 3 known as Buyer Section x Journey Protection, and it combines inputs from brokers 1 and a couple of to create detailed segment-specific analyses. It makes use of 4 subagents:
Agent 4 known as the State Transition Agent. It analyzes numerous product state factors in buyer journey flows. Its sub-agents create Mermaid state diagrams representing completely different journey states, segment-specific state situation diagrams, and generate associated take a look at eventualities and steps.
The workflow, proven within the following diagram, concludes with a fundamental extract, rework, and cargo (ETL) course of that consolidates and deduplicates the info from the brokers, saving the ultimate output as a textual content file.
This systematic method facilitates complete protection of buyer journeys, segments, and numerous diagram sorts, enabling thorough take a look at protection technology by way of iterative processing by brokers and subagents.
Addressing limitations and enhancing capabilities
In our journey to develop a extra strong and environment friendly device utilizing Strands Brokers, we recognized 5 essential limitations in our preliminary method:
- Context and hallucination challenges – Our first workflow confronted limitations from segregated agent operations the place particular person brokers independently collected knowledge and created visible representations. This isolation led to restricted contextual understanding, leading to lowered accuracy and elevated hallucinations within the outputs.
- Knowledge technology inefficiencies – The restricted context accessible to brokers induced one other crucial subject: the technology of extreme irrelevant knowledge. With out correct contextual consciousness, brokers produced much less centered outputs, resulting in noise that obscured worthwhile insights.
- Restricted parsing capabilities – The preliminary system’s knowledge parsing scope proved too slender, restricted to solely buyer segments, journey mapping, and fundamental necessities. This restriction prevented brokers from accessing the total spectrum of knowledge wanted for complete evaluation.
- Single-source enter constraint – The workflow might solely course of Phrase paperwork, creating a major bottleneck. Trendy growth environments require knowledge from a number of sources, and this limitation prevented holistic knowledge assortment.
- Inflexible structure issues – Importantly, the primary workflow employed a tightly coupled system with inflexible orchestration. This structure made it tough to switch, lengthen, or reuse elements, limiting the system’s adaptability to altering necessities.
In our second iteration, we wanted to implement strategic options to handle these points.
Workflow iteration 2: Complete evaluation workflow
Our second iteration represents an entire reimagining of the agentic workflow structure. Moderately than patching particular person issues, we rebuilt from the bottom up with modularity, context-awareness, and extensibility as core rules:
Agent 1 is the clever gateway. The file kind determination agent serves because the system’s entry level and router. Processing documentation information, Figma designs, and code repositories, it categorizes and directs knowledge to applicable downstream brokers. This clever routing is crucial for sustaining each effectivity and accuracy all through the workflow.
Agent 2 is for specialised knowledge extraction. The Knowledge Extractor agent employs six specialised subagents, every centered on particular extraction domains. This parallel processing method facilitates thorough protection whereas sustaining sensible pace. Every subagent operates with domain-specific information, extracting nuanced data that generalized approaches may overlook.
Agent 3 is the Visualizer agent, and it transforms extracted knowledge into six distinct Mermaid diagram sorts, every serving particular analytical functions. Entity relation diagrams map knowledge relationships and constructions, and stream diagrams visualize processes and workflows. Requirement diagrams make clear product specs, and UX requirement visualizations illustrate consumer expertise flows. Course of stream diagrams element system operations, and thoughts maps reveal function relationships and hierarchies. These visualizations present a number of views on the identical data, serving to each human reviewers and downstream brokers perceive patterns and connections inside advanced datasets.
Agent 4 is the Knowledge Condenser agent, and it performs essential synthesis by way of clever context distillation, ensuring every downstream agent receives precisely the knowledge wanted for its specialised activity. This agent, powered by its condensed data generator, merges outputs from each the Knowledge Extractor and Visualizer brokers whereas performing subtle evaluation.
The agent extracts crucial parts from the total textual content context—acceptance standards, enterprise guidelines, buyer segments, and edge instances—creating structured summaries that protect important particulars whereas lowering token utilization. It compares every textual content file with its corresponding Mermaid diagram, capturing data that is perhaps missed in visible representations alone. This cautious processing maintains data integrity throughout agent handoffs, ensuring vital knowledge just isn’t misplaced because it flows by way of the system. The result’s a set of condensed addendums that enrich the Mermaid diagrams with complete context. This synthesis makes certain that when data strikes to check technology, it arrives full, structured, and optimized for processing.
Agent 5 is the Take a look at Generator agent brings collectively the collected, visualized, and condensed data to supply complete take a look at suites. Working with six Mermaid diagrams plus condensed data from Agent 4, this agent employs a pipeline of 5 subagents. The Journey Evaluation Mapper, State of affairs Identification Agent, and the Knowledge Circulation Mapping subagents generate complete take a look at instances based mostly on their take of the enter knowledge flowing from Agent 4.With the take a look at instances generated throughout three crucial views, the Take a look at Instances Generator evaluates them, reformatting in accordance with inner pointers for consistency. Lastly, the Take a look at Suite Organizer performs deduplication and optimization, delivering a ultimate take a look at suite that balances comprehensiveness with effectivity.
The system now handles way over the fundamental necessities and journey mapping of Workflow 1—it processes product necessities, UX specs, acceptance standards, and workstream extraction whereas accepting inputs from Figma designs, code repositories, and a number of doc sorts. Most significantly, the shift to modular structure basically modified how the system operates and evolves. In contrast to our inflexible first workflow, this design permits for reusing outputs from earlier brokers, integrating new testing kind brokers, and intelligently deciding on take a look at case turbines based mostly on consumer necessities, positioning the system for steady adaptation.
The next determine exhibits our second iteration of SAARAM with 5 principal brokers and a number of subagents with context engineering and compression.
Extra Strands Brokers options
Strands Brokers supplied the muse for our multi-agent system, providing a model-driven method that simplified advanced agent growth. As a result of the SDK can join fashions with instruments by way of superior reasoning capabilities, we constructed subtle workflows with just a few strains of code. Past its core performance, two key options proved important for our manufacturing deployment: lowering hallucinations with structured outputs and workflow orchestration.
Decreasing hallucinations with structured outputs
The structured output function of Strands Brokers makes use of Pydantic fashions to rework historically unpredictable LLM outputs into dependable, type-safe responses. This method addresses a elementary problem in generative AI: though LLMs excel at producing humanlike textual content, they’ll wrestle with persistently formatted outputs wanted for manufacturing methods. By imposing schemas by way of Pydantic validation, we be sure that responses conform to predefined constructions, enabling seamless integration with current take a look at administration methods.
The next pattern implementation demonstrates how structured outputs work in observe:
Pydantic routinely validates LLM responses towards outlined schemas to facilitate kind correctness and required area presence. When responses don’t match the anticipated construction, validation errors present clear suggestions about what wants correction, serving to stop malformed knowledge from propagating by way of the system. In the environment, this method delivered constant, predictable outputs throughout the brokers no matter immediate variations or mannequin updates, minimizing a complete class of knowledge formatting errors. Because of this, our growth group labored extra effectively with full IDE assist.
Workflow orchestration advantages
The Strands Brokers workflow structure supplied the subtle coordination capabilities our multi-agent system required. The framework enabled structured coordination with specific activity definitions, computerized parallel execution for unbiased duties, and sequential processing for dependent operations. This meant we might construct advanced agent-to-agent communication patterns that may have been tough to implement manually.
The next pattern snippet exhibits methods to create a workflow in Strands Brokers SDK:
The workflow system delivered three crucial capabilities for our use case. First, parallel processing optimization allowed journey evaluation, situation identification, and protection evaluation to run concurrently, with unbiased brokers processing completely different facets with out blocking one another. The system routinely allotted sources based mostly on availability, maximizing throughput.
Second, clever dependency administration made certain that take a look at growth waited for situation identification to be accomplished, and group duties trusted the take a look at instances being generated. Context was preserved and handed effectively between dependent phases, sustaining data integrity all through the workflow.
Lastly, the built-in reliability options supplied the resilience our system required. Computerized retry mechanisms dealt with transient failures gracefully, state persistence enabled pause and resume capabilities for long-running workflows, and complete audit logging supported each debugging and efficiency optimization efforts.
The next desk exhibits examples of enter into the workflow and the potential outputs.
| Enter: Enterprise requirement doc | Output: Take a look at instances generated |
Useful necessities:
|
TC006: Bank card cost success State of affairs: Buyer completes buy utilizing legitimate bank card Steps: 1. Add objects to cart and proceed to checkout. Anticipated consequence: Checkout kind displayed. 2. Enter delivery data. Anticipated consequence: Delivery particulars saved. 3. Choose bank card cost technique. Anticipated consequence: Card kind proven. 4. Enter legitimate card particulars. Anticipated consequence: Card validated. 5. Submit cost. Anticipated consequence: Fee processed, order confirmed.TC008: Fee failure dealing with State of affairs: Fee fails as a result of inadequate funds or card decline Steps: 1. Enter card with inadequate funds. Anticipated consequence: Fee declined message. 2. System gives retry possibility. Anticipated consequence: Fee kind redisplayed. 3. Strive various cost technique. Anticipated consequence: Different cost profitable. TC009: Fee gateway timeout TC010: Refund processing |
Integration with Amazon Bedrock
Amazon Bedrock served as the muse for our AI capabilities, offering seamless entry to Claude Sonnet by Anthropic by way of the Strands Brokers built-in AWS service integration. We chosen Claude Sonnet by Anthropic for its distinctive reasoning capabilities and skill to grasp advanced cost area necessities. The Strands Brokers versatile LLM API integration made this implementation simple. The next snippet exhibits methods to effortlessly create an agent in Strands Brokers:
The managed service structure of Amazon Bedrock lowered infrastructure complexity from our deployment. The service supplied computerized scaling that adjusted to our workload calls for, facilitating constant efficiency throughout the brokers no matter visitors patterns. Constructed-in retry logic and error dealing with improved system reliability considerably, lowering the operational overhead sometimes related to managing AI infrastructure at scale. The mix of the subtle orchestration capabilities of Strands Brokers and the strong infrastructure of Amazon Bedrock created a production-ready system that might deal with advanced take a look at technology workflows whereas sustaining excessive reliability and efficiency requirements.
The next diagram exhibits the deployment of the SARAAM agent with Amazon Bedrock AgentCore and Amazon Bedrock.
Outcomes and enterprise influence
The implementation of SAARAM has improved our QA processes with measurable enhancements throughout a number of dimensions. Earlier than SAARAM, our QA engineers spent 3–5 days manually analyzing BRD paperwork and UI mocks to create complete take a look at instances. This handbook course of is now lowered to hours, with the system attaining:
- Take a look at case technology time: Potential lowered from 1 week to hours
- Useful resource optimization: QA effort decreased from 1.0 full-time worker (FTE) to 0.2 FTE for validation
- Protection enchancment: 40% extra edge instances recognized in comparison with handbook course of
- Consistency: 100% adherence to check case requirements and codecs
The accelerated take a look at case technology has pushed enhancements in our core enterprise metrics:
- Fee success fee: Elevated by way of complete edge case testing and risk-based take a look at prioritization
- Fee expertise: Enhanced buyer satisfaction as a result of groups can now iterate on take a look at protection throughout the design section
- Developer velocity: Product and growth groups generate preliminary take a look at instances throughout design, enabling early high quality suggestions
SAARAM captures and preserves institutional information that was beforehand depending on particular person QA engineers:
- Testing patterns from skilled professionals at the moment are codified
- Historic take a look at case learnings are routinely utilized to new options
- Constant testing approaches throughout completely different cost strategies and industries
- Lowered onboarding time for brand spanking new QA group members
This iterative enchancment implies that the system turns into extra worthwhile over time.
Classes realized
Our journey creating SAARAM supplied essential insights for constructing production-ready AI methods. Our breakthrough got here from finding out how area specialists suppose quite than optimizing how AI processes data. Understanding the cognitive patterns of testers and QA professionals led to an structure that naturally aligns with human reasoning. This method produced higher outcomes in comparison with purely technical optimizations. Organizations constructing related methods ought to make investments time observing and interviewing area specialists earlier than designing their AI structure—the insights gained straight translate to more practical agent design.
Breaking advanced duties into specialised brokers dramatically improved each accuracy and reliability. Our multi-agent structure, enabled by the orchestration capabilities of Strands Brokers, handles nuances that monolithic approaches persistently miss. Every agent’s centered accountability permits deeper area experience whereas offering higher error isolation and debugging capabilities.
A key discovery was that the Strands Brokers workflow and graph-based orchestration patterns considerably outperformed conventional supervisor agent approaches. Though supervisor brokers make dynamic routing choices that may introduce variability, workflows present “brokers on rails”—a structured path facilitating constant, reproducible outcomes. Strands Brokers gives a number of patterns, together with supervisor-based routing, workflow orchestration for sequential processing with dependencies, and graph-based coordination for advanced eventualities. For take a look at technology the place consistency is paramount, the workflow sample with its specific activity dependencies and parallel execution capabilities delivered the optimum stability of flexibility and management. This structured method aligns completely with manufacturing environments the place reliability issues greater than theoretical flexibility.
Implementing Pydantic fashions by way of the Strands Brokers structured output function successfully lowered type-related hallucinations in our system. By imposing AI responses to adapt to strict schemas, we facilitate dependable, programmatically usable outputs. This method has confirmed important when consistency and reliability are nonnegotiable. The kind-safe responses and computerized validation have turn into foundational to our system’s reliability.
Our condensed data generator sample demonstrates how clever context administration maintains high quality all through multistage processing. This method of figuring out what to protect, condense, and cross between brokers helps stop the context degradation that sometimes happens in token-limited environments. The sample is broadly relevant to multistage AI methods dealing with related constraints.
What’s subsequent
The modular structure we’ve constructed with Strands Brokers permits simple adaptation to different domains inside Amazon. The identical patterns that generate cost take a look at instances may be utilized to retail methods testing, customer support situation technology for assist workflows, and cell utility UI and UX take a look at case technology. Every adaptation requires solely domain-specific prompts and schemas whereas reusing the core orchestration logic. All through the event of SAARAM, the group efficiently addressed many challenges in take a look at case technology—from lowering hallucinations by way of structured outputs to implementing subtle multi-agent workflows. Nevertheless, one crucial hole stays: the system hasn’t but been supplied with examples of what high-quality take a look at instances truly appear to be in observe.
To bridge this hole, integrating Amazon Bedrock Information Bases with a curated repository of historic take a look at instances would offer SAARAM with concrete, real-world examples throughout the technology course of. Through the use of the mixing capabilities of Strands Brokers with Amazon Bedrock Information Bases, the system might search by way of previous profitable take a look at instances to seek out related eventualities earlier than producing new ones. When processing a BRD for a brand new cost function, SAARAM would first question the information base for comparable take a look at instances—whether or not for related cost strategies, buyer segments, or transaction flows—and use these as contextual examples to information its output.
Future deployment will use Amazon Bedrock AgentCore for complete agent lifecycle administration. Amazon Bedrock AgentCore Runtime gives the manufacturing execution atmosphere with ephemeral, session-specific state administration that maintains conversational context throughout energetic classes whereas facilitating isolation between completely different consumer interactions. The observability capabilities of Bedrock AgentCore assist ship detailed visualizations of every step in SAARAM’s multi-agent workflow, which the group can use to hint execution paths by way of the 5 brokers, audit intermediate outputs from the Knowledge Condenser and Take a look at Generator brokers, and determine efficiency bottlenecks by way of real-time dashboards powered by Amazon CloudWatch with standardized OpenTelemetry-compatible telemetry.
The service permits a number of superior capabilities important for manufacturing deployment: centralized agent administration and versioning by way of the Amazon Bedrock AgentCore management aircraft, A/B testing of various workflow methods and immediate variations throughout the 5 subagents throughout the Take a look at Generator, efficiency monitoring with metrics monitoring token utilization and latency throughout the parallel execution phases, automated agent updates with out disrupting energetic take a look at technology workflows, and session persistence for sustaining context when QA engineers iteratively refine take a look at suite outputs. This integration positions SAARAM for enterprise-scale deployment whereas offering the operational visibility and reliability controls that rework it from a proof of idea right into a manufacturing system able to dealing with the AMET group’s formidable objective of increasing past Funds QA to serve the broader group.
Conclusion
SAARAM demonstrates how AI can change conventional QA processes when designed with human experience at its core. By lowering take a look at case creation from 1 week to hours whereas enhancing high quality and protection, we’ve enabled sooner function deployment and enhanced cost experiences for thousands and thousands of consumers throughout the MENA area. The important thing to our success wasn’t merely superior AI expertise—it was the mix of human experience, considerate structure design, and strong engineering practices. By way of cautious research of how skilled QA professionals suppose, implementation of multi-agent methods that mirror these cognitive patterns, and minimization of AI limitations by way of structured outputs and context engineering, we’ve created a system that enhances quite than replaces human experience.
For groups contemplating related initiatives, our expertise emphasizes three crucial success components: make investments time understanding the cognitive processes of area specialists, implement structured outputs to reduce hallucinations, and design multi-agent architectures that mirror human problem-solving approaches. These QA instruments aren’t meant to switch human testers, they amplify their experience by way of clever automation. When you’re desirous about beginning your journey on brokers with AWS, try our pattern Strands Brokers implementations repo or our latest launch, Amazon Bedrock AgentCore, and the end-to-end examples with deployment on our Amazon Bedrock AgentCore samples repo.
In regards to the authors
Jayashree is a High quality Assurance Engineer at Amazon Music Tech, the place she combines rigorous handbook testing experience with an rising ardour for GenAI-powered automation. Her work focuses on sustaining excessive system high quality requirements whereas exploring progressive approaches to make testing extra clever and environment friendly. Dedicated to lowering testing monotony and enhancing product high quality throughout Amazon’s ecosystem, Jayashree is on the forefront of integrating synthetic intelligence into high quality assurance practices.
Harsha Pradha G is a Snr. High quality Assurance Engineer half in MENA Funds at Amazon. With a robust basis in constructing complete high quality methods, she brings a singular perspective to the intersection of QA and AI as an rising QA-AI integrator. Her work focuses on bridging the hole between conventional testing methodologies and cutting-edge AI improvements, whereas additionally serving as an AI content material strategist and AI Writer.
Fahim Surani is Senior Options Architect as AWS, serving to prospects throughout Monetary Companies, Power, and Telecommunications design and construct cloud and generative AI options. His focus since 2022 has been driving enterprise cloud adoption, spanning cloud migrations, price optimization, event-driven architectures, together with main implementations acknowledged as early adopters of Amazon’s newest AI capabilities. Fahim’s work covers a variety of use instances, with a main curiosity in generative AI, agentic architectures. He’s a daily speaker at AWS summits and business occasions throughout the area.






