Offering efficient multilingual buyer assist in international companies presents important operational challenges. Via collaboration between AWS and DXC Know-how, we’ve developed a scalable voice-to-voice (V2V) translation prototype that transforms how contact facilities deal with multi-lingual buyer interactions.
On this put up, we talk about how AWS and DXC used Amazon Join and different AWS AI providers to ship close to real-time V2V translation capabilities.
Problem: Serving clients in a number of languages
In Q3 2024, DXC Know-how approached AWS with a crucial enterprise problem: their international contact facilities wanted to serve clients in a number of languages with out the exponential value of hiring language-specific brokers for the decrease quantity languages. Beforehand, DXC had explored a number of present options however discovered limitations in every strategy – from communication constraints to infrastructure necessities that impacted reliability, scalability, and operational prices. DXC and AWS determined to prepare a centered hackathon the place DXC and AWS Resolution Architects collaborated to:
- Outline important necessities for real-time translation
- Set up latency and accuracy benchmarks
- Create seamless integration paths with present methods
- Develop a phased implementation technique
- Put together and check an preliminary proof of idea setup
Enterprise impression
For DXC, this prototype was used as an enabler, permitting technical expertise maximization, operational transformation, and price enhancements by:
- Finest technical experience supply – Hiring and matching brokers primarily based on technical information quite than spoken language, ensuring clients get prime technical assist no matter language limitations
- International operational flexibility – Eradicating geographical and language constraints in hiring, placement, and assist supply whereas sustaining constant service high quality throughout all languages
- Price discount – Eliminating multi-language experience premiums, specialised language coaching, and infrastructure prices by pay-per-use translation mannequin
- Comparable expertise to native audio system – Sustaining pure dialog move with close to real-time translation and audio suggestions, whereas delivering premium technical assist in buyer’s most well-liked language
Resolution overview
The Amazon Join V2V translation prototype makes use of AWS superior speech recognition and machine translation applied sciences to allow real-time dialog translation between brokers and clients, permitting them to talk of their most well-liked languages whereas having pure conversations. It consists of the next key elements:
- Speech recognition – The client’s spoken language is captured and transformed into textual content utilizing Amazon Transcribe, which serves because the speech recognition engine. The transcript (textual content) is then fed into the machine translation engine.
- Machine translation – Amazon Translate, the machine translation engine, interprets the client’s transcript into the agent’s most well-liked language in close to actual time. The translated transcript is transformed again into speech utilizing Amazon Polly, which serves because the text-to-speech engine.
- Bidirectional translation – The method is reversed for the agent’s response, translating their speech into the client’s language and delivering the translated audio to the client.
- Seamless integration – The V2V translation pattern mission integrates with Amazon Join, enabling brokers to deal with buyer interactions in a number of languages with none extra effort or coaching, utilizing the Amazon Join Streams JS and Amazon Join RTC JS libraries.
The prototype might be prolonged with different AWS AI providers to additional customise the interpretation capabilities. It’s open supply and prepared for personalisation to fulfill your particular wants.
The next diagram illustrates the answer structure.
The next screenshot illustrates a pattern agent internet utility.
The consumer interface consists of three sections:
- Contact Management Panel – A softphone shopper utilizing Amazon Join
- Buyer Controls – Buyer-to-agent interplay controls, together with Transcribe Buyer Voice, Translate Buyer Voice, and Synthesize Buyer Voice
- Agent controls – Agent-to-customer interplay controls, together with Transcribe Agent Voice, Translate Agent Voice, and Synthesize Agent Voice
Challenges when implementing close to real-time voice translation
The Amazon Join V2V pattern mission was designed to reduce the audio processing time from the second the client or agent finishes talking till the translated audio stream is began. Nevertheless, even with the shortest audio processing time, the consumer expertise nonetheless doesn’t match the expertise of an actual dialog when each are talking the identical language. That is as a result of particular sample of the client solely listening to the agent’s translated speech, and the agent solely listening to the client’s translated speech. The next diagram shows that sample.
The instance workflow consists of the next steps:
- The client begins talking in their very own language, and speaks for 10 seconds.
- As a result of the agent solely hears the client’s translated speech, the agent first hears 10 seconds of silence.
- When buyer finishes talking, the audio processing time takes 1–2 seconds, throughout which era each the client and agent hear silence.
- The client’s translated speech is streamed to the agent. Throughout that point, the client hears silence.
- When the client’s translated speech playback is full, the agent begins talking, and speaks for 10 seconds.
- As a result of buyer solely hears the agent’s translated speech, the client hears 10 seconds of silence.
- When the agent finishes talking, the audio processing time takes 1–2 seconds, throughout which era each the client and agent hear silence.
- The agent’s translated speech is streamed to the agent. Throughout that point, the agent hears silence.
On this state of affairs, the client hears a single block of twenty-two–24 seconds of an entire silence, from the second they completed talking till they hear the agent’s translated voice. This creates a suboptimal expertise, as a result of the client may not make sure what is going on throughout these 22–24 seconds—as an example, if the agent was in a position to hear them, or if there was a technical subject.
Audio streaming add-ons
In a face-to-face dialog state of affairs between two folks that don’t converse the identical language, they may have one other individual as a translator or interpreter. An instance workflow consists of the next steps:
- Individual A speaks in their very own language, which is heard by Individual B and the translator.
- The translator interprets what Individual A stated to Individual B’s language. The interpretation is heard by Individual B and Individual A.
Basically, Individual A and Individual B hear one another talking their very own language, they usually additionally hear the interpretation (from the translator). There’s no ready in silence, which is much more vital in non-face-to-face conversations (equivalent to contact middle interactions).
To optimize the client/agent expertise, the Amazon Join V2V pattern mission implements audio streaming add-ons to simulate a extra pure dialog expertise. The next diagram illustrates an instance workflow.
The workflow consists of the next steps:
- The client begins talking in their very own language, and speaks for 10 seconds.
- The agent hears the client’s unique voice, at a decrease quantity (“Stream Buyer Mic to Agent” enabled).
- When the client finishes talking, the audio processing time takes 1–2 seconds. Throughout that point, the client and agent hear refined audio suggestions—contact middle background noise—at a really low quantity (“Audio Suggestions” enabled).
- The client’s translated speech is then streamed to the agent. Throughout that point, the client hears their translated speech, at a decrease quantity (“Stream Buyer Translation to Buyer” enabled).
- When the client’s translated speech playback is full, the agent begins talking, and speaks for 10 seconds.
- The client hears the agent’s unique voice, at a decrease quantity (“Stream Agent Mic to Buyer” enabled).
- When the agent finishes talking, the audio processing time takes 1–2 seconds. Throughout that point, the client and agent hear refined audio suggestions—contact middle background noise—at a really low quantity (“Audio Suggestions” enabled).
- The agent’s translated speech is then streamed to the agent. Throughout that point, the agent hears their translated speech, at a decrease quantity (“Stream Agent Translation to Agent” enabled).
On this state of affairs, the client hears two brief blocks (1–2 seconds) of refined audio suggestions, as an alternative of a single block of twenty-two–24 seconds of full silence. This sample is way nearer to a face-to-face dialog that features a translator.
The audio streaming add-ons present extra advantages, together with:
- Voice traits – In circumstances when the agent and buyer solely hear their translated and synthesized speech, the precise voice traits are misplaced. As an example, the agent can’t hear if the client was speaking sluggish or quick, if the client was upset or calm, and so forth. The translated and synthesized speech doesn’t carry over that info.
- High quality assurance – In circumstances when name recording is enabled, solely the client’s unique voice and the agent’s synthesized speech are recorded, as a result of the interpretation and the synthetization are carried out on the agent (shopper) aspect. This makes it troublesome for QA groups to correctly consider and audit the conversations, together with the various silent blocks inside it. As an alternative, when the audio streaming add-ons are enabled, there aren’t any silent blocks, and the QA group can hear the agent’s unique voice, the client’s unique voice, and their respective translated and synthesized speech, all in a single audio file.
- Transcription and translation accuracy – Having each the unique and translated speech obtainable within the name recording makes it simple to detect particular phrases that may enhance transcription accuracy (through the use of Amazon Transcribe customized vocabularies) or translation accuracy (utilizing Amazon Translate customized terminologies), to ensure that your model names, character names, mannequin names, and different distinctive content material are transcribed and translated to the specified consequence.
Get began with Amazon Join V2V
Prepared to remodel your contact middle’s communication? Our Amazon Join V2V pattern mission is now obtainable on GitHub. We invite you to discover, deploy, and experiment with this highly effective prototype. You may it as a basis for growing progressive multi-lingual communication options in your personal contact middle, by the next key steps:
- Clone the GitHub repository.
- Check totally different configurations for audio streaming add-ons.
- Evaluate the pattern mission’s limitations within the README.
- Develop your implementation technique:
- Implement sturdy safety and compliance controls that meet your group’s requirements.
- Collaborate together with your buyer expertise group to outline your particular use case necessities.
- Stability between automation and the agent’s guide controls (for instance, use an Amazon Join contact move to routinely set contact attributes for most well-liked languages and audio streaming add-ons).
- Use your most well-liked transcribe, translate, and text-to-speech engines, primarily based on particular language assist necessities and enterprise, authorized, and regional preferences.
- Plan a phased rollout, beginning with a pilot group, then iteratively optimize your transcription customized vocabularies and translation customized terminologies.
Conclusion
The Amazon Join V2V pattern mission demonstrates how Amazon Join and superior AWS AI providers can break down language limitations, improve operational flexibility, and cut back assist prices. Get began now and revolutionize how your contact middle communicates throughout language limitations!
Concerning the Authors
Milos Cosic is a Principal Options Architect at AWS.
EJ Ferrell is a Senior Options Architect at AWS.
Adam El Tanbouli is a Technical Program Supervisor for Prototyping and Assist Providers at DXC Trendy Office.