- To deploy Voxtral-Mini, use the next code:
- To deploy Voxtral-Small, use the next code:
- Open and run Voxtral-vLLM-BYOC-SageMaker.ipynb to deploy your endpoint and take a look at with textual content, audio, and performance calling capabilities.
Docker container configuration
The GitHub repo accommodates the complete Dockerfile. The next code snippet highlights the important thing components:
This Dockerfile creates a specialised container that extends the official vLLM server with Voxtral-specific capabilities by including important audio processing libraries (mistral_common for tokenization, librosa/soundfile/pydub for audio dealing with) whereas configuring the right SageMaker surroundings variables for mannequin loading and caching. The method separates infrastructure from enterprise logic by retaining the container generic and permitting SageMaker to dynamically inject model-specific code (mannequin.py and serving.properties) from Amazon S3 at runtime, enabling versatile deployment of various Voxtral variants with out requiring container rebuilds.
Mannequin configurations
The complete mannequin configurations are within the serving.properties file positioned within the code folder. The next code snippet highlights the important thing configurations:
This configuration file supplies Voxtral-specific optimizations that observe Mistral’s official suggestions for vLLM server deployment, establishing correct tokenization modes, audio processing parameters (supporting as much as eight audio information per immediate with 30-minute transcription functionality), and utilizing the newest vLLM v0.10.0+ efficiency options like chunked prefill and prefix caching. The modular design helps seamless switching between Voxtral-Mini and Voxtral-Small by merely altering the model_id and tensor_parallel_degree parameters, whereas sustaining optimum reminiscence utilization and enabling superior caching mechanisms for improved inference efficiency.
Customized inference handler
The complete customized inference code is within the mannequin.py file positioned within the code folder. The next code snippet highlights the important thing capabilities:
This tradition inference handler creates a FastAPI-based server that straight integrates with the vLLM server for optimum Voxtral efficiency. The handler processes multimodal content material together with base64-encoded audio and audio URLs, dynamically masses mannequin configurations from the serving.properties file, and helps superior options like perform calling for Voxtral-Small deployments.
SageMaker deployment code
The Voxtral-vLLM-BYOC-SageMaker.ipynb pocket book included within the Voxtral-vllm-byoc folder orchestrates your complete deployment course of for each Voxtral fashions:
Mannequin use circumstances
The Voxtral fashions assist varied textual content and speech-to-text use circumstances, and the Voxtral-Small mannequin helps software use with voice enter. Confer with the GitHub repository for the entire code. On this part, we offer code snippets for various use circumstances that the mannequin helps.
Textual content-only
The next code exhibits a fundamental text-based dialog with the mannequin. The person sends a textual content question and receives a structured response:
Transcription-only
The next instance focuses on speech-to-text transcription by setting temperature to 0 for deterministic output. The mannequin processes an audio file URL or audio file transformed to base64 code, then returns the transcribed textual content with out extra interpretation:
Textual content and audio understanding
The next code combines each textual content directions and audio enter for multimodal processing. The mannequin can observe particular textual content instructions whereas analyzing the supplied audio file in a single inference go, enabling extra complicated interactions like guided transcription or audio evaluation duties:
Device use
The next code showcases perform calling capabilities, the place the mannequin can interpret voice instructions and execute predefined instruments. The instance demonstrates climate queries by voice enter, with the mannequin mechanically calling the suitable perform and returning structured outcomes:
Strands Brokers integration
The next instance exhibits easy methods to combine Voxtral with the Strands framework to create clever brokers able to utilizing a number of instruments. The agent can mechanically choose and execute applicable instruments (similar to calculator, file operations, or shell instructions from Strands prebuilt instruments) primarily based on person queries, enabling complicated multi-step workflows by pure language interplay:
Clear up
Whenever you end experimenting with this instance, delete the SageMaker endpoints that you simply created within the pocket book to keep away from pointless prices:
Conclusion
On this publish, we demonstrated easy methods to efficiently self-host Mistral’s open supply Voxtral fashions on SageMaker utilizing the BYOC method. We’ve created a production-ready system that makes use of the newest vLLM framework and official Voxtral optimizations for each Mini and Small mannequin variants. The answer helps the complete spectrum of Voxtral capabilities, together with text-only conversations, audio transcription, subtle multimodal understanding, and performance calling straight from voice enter. With this versatile structure, you possibly can change between Voxtral-Mini and Voxtral-Small fashions by easy configuration updates with out requiring container rebuilds.
Take your multimodal AI functions to the subsequent stage by attempting out the entire code from the GitHub repository to host the Voxtral mannequin on SageMaker and begin constructing your individual voice-enabled functions. Discover Voxtral’s full potential by visiting Mistral’s official web site to find detailed capabilities, efficiency benchmarks, and technical specs. Lastly, discover the Strands Brokers framework to seamlessly create agentic functions that may execute complicated workflows.
Concerning the authors
Ying Hou, PhD, is a Sr. Specialist Resolution Architect for GenAI at AWS, the place she collaborates with mannequin suppliers to onboard the newest and most clever AI fashions onto AWS platforms. With deep experience in Gen AI, ASR, laptop imaginative and prescient, NLP, and time-series forecasting fashions, she works intently with clients to design and construct cutting-edge ML and GenAI functions.


