Design patterns for scalable voice brokers matter for organizations that must ship quick, pure, and dependable voice experiences. Many groups face challenges like excessive latency, managing real-time audio, and coordinating a number of brokers in advanced workflows.
On this put up, you’ll learn to use Amazon Nova Sonic, Amazon Bedrock AgentCore, and Strands BidiAgent to construct scalable, maintainable voice brokers that deal with these challenges effectively, leading to extra responsive and clever buyer interactions.
We’ll discover three well-liked architectural patterns for voice brokers, highlighting their trade-offs and finest practices for minimizing latency.
The constructing blocks
Earlier than diving deeper into the structure patterns, right here’s a fast overview of the three key parts used because the pattern resolution on this put up.

Amazon Nova Sonic is a basis mannequin that creates pure, human-like speech-to-speech conversations for generative AI purposes. Customers can work together with AI by voice in actual time, with capabilities for understanding tone, pure conversational circulate, and performing actions.
Amazon Bedrock AgentCore Runtime is a serverless internet hosting atmosphere for AI brokers. You bundle your agent as a container, deploy to AgentCore Runtime, and it handles scaling, session isolation, and billing. For voice brokers, it gives bidirectional WebSocket streaming with SigV4 auth, microVM-level session isolation to keep away from noisy-neighbor latency spikes, AgentCore Gateway for shared software internet hosting utilizing the Mannequin Context Protocol (MCP) open supply protocol, persistent reminiscence throughout periods, and telemetry for voice-specific metrics like time-to-first-audio.
Strands Brokers is an open supply framework for constructing AI brokers. Its BidiAgent class is one integration choice between Nova Sonic and your software. It manages the bidirectional stream lifecycle, routes software calls, and handles session administration, simplifying the voice agent software by the mannequin SDK interface.
Three integration patterns: software, agent-as-tool (sub-agent), and session segmentation
As a substitute of constructing one omnipotent agent, fashionable voice methods are more and more composed of tool-driven brokers, sub-agents performing as instruments and session segmentation methods that isolate prompts, reminiscence, and permissions. These patterns enable groups to decompose massive assistants into smaller, specialised, and reusable parts whereas sustaining clear safety boundaries.
Earlier than operating the samples within the following sections, set up Python and the required dependencies, together with strands-agents and boto3, and ensure your IAM setup has the mandatory permissions for the required companies. For the total instance, discuss with the GitHub repository.
Sample 1: AgentCore Gateway – software choice for low latency
A software name is when a voice agent sends enter to an exterior operate or service, which processes it and returns output. It lets the agent carry out duties like querying a database or triggering a service shortly and securely, with out further reasoning steps.
With AgentCore Gateway, you expose your current enterprise logic as instruments, discrete capabilities that Nova Sonic can name straight throughout a dialog. The voice mannequin selects which software to invoke, passes parameters, will get a consequence, and speaks it again. There’s no intermediate reasoning layer between the mannequin and the software.

AgentCore Gateway hosts MCP servers as managed endpoints. MCP is the protocol, AgentCore Gateway is the AWS characteristic that runs them. The voice agent connects by way of Gateway ARNs.
When a person says “What’s my account stability?”, Nova Sonic:
- Understands the intent from speech.
- Selects
get_account_balancefrom the out there MCP instruments. - Calls the software with the appropriate parameters.
- Speaks the consequence again.
Commerce-off: Nova Sonic makes all the selections. If a software name requires multi-step validation, conditional logic, or chaining a number of operations collectively, that reasoning burden falls totally on the voice mannequin’s system immediate. For easy instruments that is high-quality. For advanced workflows, it will get brittle.
Sample 2: Sub-agent – extra reasoning with decoupled brokers
With the sub-agent or agent-as-tool sample, your current enterprise logic runs in autonomous brokers, every with its personal mannequin, system immediate, instruments, and reasoning capabilities. The voice orchestrator delegates entire duties to those sub-agents as an alternative of calling particular person instruments.
There are various methods to connect with a sub-agent out of your voice agent. Agent-to-Agent (A2A) and Strands Agent-as-Device are two frequent approaches:
- Native agent-as-tool: The sub-agent runs in-process, wrapped as a
@softwareoperate utilizing the Brokers as Instruments sample in Strands. That is probably the most easy method with no community hop and no separate deployment. The trade-off is that the sub-agent shares the identical course of and scales with the orchestrator. - Distant agent by way of A2A protocol: The sub-agent is deployed as an impartial A2A server on AgentCore Runtime (or a distant server) and invoked over the community. A2A is an open protocol for agent-to-agent communication. As MCP connects brokers to instruments, A2A connects brokers to different brokers. Because the AWS weblog on A2A protocol help in AgentCore Runtime explains, brokers constructed with completely different frameworks (Strands, OpenAI, LangGraph, Google ADK) can share context and reasoning in a typical format. This gives full deployment independence and cross-framework interoperability.

Strands Brokers has built-in help for each protocols, MCP for software entry and A2A for agent-to-agent communication. For a hands-on walkthrough, see the group information on Agent Collaboration: Strands Brokers, MCP, and the Agent2Agent Protocol.
Right here’s the native agent-as-tool method, every sub-agent is a @software wrapping a full Strands Agent:
The voice orchestrator then makes use of BidiAgent with these sub-agent instruments:
The sub-agent does its personal pondering. Nova Sonic doesn’t must orchestrate the person steps. It delegates and speaks the consequence.
Commerce-off: Every sub-agent name provides latency: the sub-agent’s personal mannequin inference plus its software calls. In a voice dialog, this implies longer silence whereas the sub-agent causes. The AWS weblog on multi-agent voice assistants recommends beginning with smaller, environment friendly fashions like Amazon Nova 2 Lite for sub-agents to scale back latency whereas nonetheless dealing with specialised duties successfully.
Amazon Nova 2 Sonic helps asynchronous software calling, so the dialog continues naturally whereas instruments run within the background. It retains accepting enter, can run a number of instruments in parallel, and gracefully adapts if the person adjustments their request mid-process, delivering all outcomes whereas specializing in what’s nonetheless related.
Sample 3: Session segmentation for ultra-low latency
There’s a 3rd method value contemplating. It doesn’t map neatly to the MCP or sub-agent patterns, however is purpose-built for voice eventualities the place latency is the overriding concern.
As a substitute of delegating exterior instruments or sub-agents, you section the dialog into logical phases, every with its personal Nova Sonic session, system immediate, and gear set. When the dialog transitions from one section to the following (for instance, from authentication to account inquiry), you shut the present session and open a brand new one with a special immediate and instruments, throughout the identical WebSocket connection. Every sub-voice-agent can use its personal MCP gateways, instruments, and even sub-agents — the variations that it operates with a centered immediate and minimal software floor, lowering reasoning overhead and latency.

Consider a banking voice assistant with three dialog phases: authentication, account administration, and mortgage inquiry. Fairly than loading one huge system immediate with each software, you run every section as a centered Nova Sonic session:
Every section will get a clear Nova Sonic session with:
- A centered system immediate: Shorter, extra particular, much less room for the mannequin to get confused.
- Solely the related instruments: by way of MCP gateways, native instruments, or each. The mannequin doesn’t waste reasoning cycles selecting between 15 instruments when it solely wants 3.
- Optionally its personal sub-agents: a section that requires deeper reasoning can use Sample 2 internally, whereas less complicated phases keep tool-only.
- The earlier session context could be handed into the brand new session as chat historical past, so the general dialog retains continuity.
In comparison with software, sub-agent, and session segmentation patterns
| Issue | Device | Sub-Agent (Agent-as-Device) | Session Segmentation |
| Latency | Low | Larger (sub-agent reasoning) | Lowest (with latency throughout session transitions) |
| Device set per flip | Instruments loaded | Sub-agent’s instruments | Solely phase-relevant instruments |
| System immediate | One massive immediate | Orchestrator + sub-agent prompts | Small, phase-specific prompts |
| Reasoning depth | Voice mannequin solely | Voice mannequin + sub-agent | Voice mannequin solely (per section) |
| Reuse of current brokers | Excessive (identical MCP instruments) | Highest (identical sub-agents) | Medium (composes instruments/sub-agents per section) |
| Dialog continuity | Seamless | Seamless | Requires handoff logic between phases |
Latency finest practices for voice brokers
Latency is a key consideration when constructing voice versus textual content brokers. Listed here are sensible methods to maintain response instances quick and responsive:
Begin with small fashions for sub-agents. Your voice orchestrator makes use of Nova Sonic for the dialog, however sub-agents don’t want a big mannequin. Begin with Amazon Nova 2 Lite or Nova 2 Micro. They’re quick, value optimized, and deal with most specialised duties properly. You may at all times improve a particular sub-agent to a bigger mannequin if high quality requires it, however default to small.
Design stateful sub-agents with caching. A stateless sub-agent that hits a database or API on each name provides latency each time. As a substitute, design sub-agents to cache outcomes from information sources (APIs, AWS Lambda capabilities, databases) inside a session. If the banking sub-agent fetches account particulars as soon as, it ought to maintain that information in reminiscence and serve subsequent questions (stability, transactions, abstract) from cache quite than making repeated backend calls.
Prefetch information after authentication. That is particularly worthwhile for contact middle eventualities. After a buyer authenticates, you already know who they’re. Don’t look ahead to them to ask earlier than pulling their information. Instantly fetch account balances, current transactions, pending alerts, and mortgage standing within the background. When the shopper asks “What’s my stability?”, the reply is already in reminiscence.
Parallelize impartial software calls. If the person asks “Give me an outline of my accounts”, don’t name get_checking_balance, then get_savings_balance, then get_credit_card_balance sequentially. Use concurrent execution so three calls occur without delay. Strands helps this natively. The agent’s software executor runs impartial calls in parallel by default.
Use filler phrases to masks software latency. When a software name or sub-agent delegation is unavoidable, instruct the voice mannequin to talk a short filler whereas ready: “Let me examine that for you…” or “One second whereas I look that up…” This retains the dialog feeling alive as an alternative of dropping into silence.
Decrease software rely per session. Device choice will get slower because the variety of out there instruments grows. In case your agent has 15 instruments however a typical dialog solely makes use of 3 to 4, think about the session segmentation sample to load solely the related instruments per section.
Clear up
After you end testing the pattern, bear in mind to scrub up the sources you created to keep away from pointless prices. Comply with the repository directions to cease companies and delete any deployed infrastructure.
Conclusion
Migrating a textual content chatbot to a voice assistant isn’t an easy wrapper job. The interplay mannequin is basically completely different, from response design to latency budgets to turn-taking conduct. However with a well-structured multi-agent structure and Amazon Bedrock AgentCore, the enterprise logic layer stays intact.
The sub-agents you’ve already constructed are your greatest asset. Reuse them.
For a working instance of a Strands BidiAgent voice assistant deployed on AgentCore Runtime with WebSocket streaming, see the AgentCore bidirectional streaming pattern.
Subsequent steps
Subsequent, you may prolong the pattern to suit your personal use case, combine your online business instruments, refine prompts for voice interactions, and take a look at the agent in real-world eventualities to organize for manufacturing deployment. To study extra about voice brokers on AWS, go to:
In regards to the authors

