Deploy Mistral AI’s Voxtral on Amazon SageMaker AI

Configure your mannequin in code/serving.properties:

To deploy Voxtral-Mini, use the next code:

possibility.model_id=mistralai/Voxtral-Mini-3B-2507
possibility.tensor_parallel_degree=1

To deploy Voxtral-Small, use the next code:

possibility.model_id=mistralai/Voxtral-Small-24B-2507
possibility.tensor_parallel_degree=4

Open and run Voxtral-vLLM-BYOC-SageMaker.ipynb to deploy your endpoint and take a look at with textual content, audio, and performance calling capabilities.

Docker container configuration

The GitHub repo accommodates the complete Dockerfile. The next code snippet highlights the important thing components:

# Customized vLLM Container for Voxtral Mannequin Deployment on SageMaker
FROM --platform=linux/amd64 vllm/vllm-openai:newest
# Set surroundings variables for SageMaker
ENV MODEL_CACHE_DIR=/choose/ml/mannequin
ENV TRANSFORMERS_CACHE=/tmp/transformers_cache
ENV HF_HOME=/tmp/hf_home
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
# Set up audio processing dependencies
RUN pip set up --no-cache-dir 
"mistral_common>=1.8.1" 
librosa>=0.10.2 
soundfile>=0.12.1 
pydub>=0.25.1

This Dockerfile creates a specialised container that extends the official vLLM server with Voxtral-specific capabilities by including important audio processing libraries (mistral_common for tokenization, librosa/soundfile/pydub for audio dealing with) whereas configuring the right SageMaker surroundings variables for mannequin loading and caching. The method separates infrastructure from enterprise logic by retaining the container generic and permitting SageMaker to dynamically inject model-specific code (mannequin.py and serving.properties) from Amazon S3 at runtime, enabling versatile deployment of various Voxtral variants with out requiring container rebuilds.

Mannequin configurations

The complete mannequin configurations are within the serving.properties file positioned within the code folder. The next code snippet highlights the important thing configurations:

# Mannequin configuration
possibility.model_id=mistralai/Voxtral-Small-24B-2507
possibility.tensor_parallel_degree=4
possibility.dtype=bfloat16
# Voxtral-specific settings (as per official documentation)
possibility.tokenizer_mode=mistral
possibility.config_format=mistral
possibility.load_format=mistral
possibility.trust_remote_code=true
# Audio processing (Voxtral specs)
possibility.limit_mm_per_prompt=audio:8
possibility.mm_processor_kwargs={"audio_sampling_rate": 16000, "audio_max_length": 1800.0}
# Efficiency optimizations (vLLM v0.10.0+ options)
possibility.enable_chunked_prefill=true
possibility.enable_prefix_caching=true
possibility.use_v2_block_manager=true

This configuration file supplies Voxtral-specific optimizations that observe Mistral’s official suggestions for vLLM server deployment, establishing correct tokenization modes, audio processing parameters (supporting as much as eight audio information per immediate with 30-minute transcription functionality), and utilizing the newest vLLM v0.10.0+ efficiency options like chunked prefill and prefix caching. The modular design helps seamless switching between Voxtral-Mini and Voxtral-Small by merely altering the model_id and tensor_parallel_degree parameters, whereas sustaining optimum reminiscence utilization and enabling superior caching mechanisms for improved inference efficiency.

Customized inference handler

The complete customized inference code is within the mannequin.py file positioned within the code folder. The next code snippet highlights the important thing capabilities:

# FastAPI app for SageMaker compatibility
app = FastAPI(title="Voxtral vLLM Inference Server", model="1.1.0")
model_engine = None
# vLLM Server Initialization for Voxtral
def start_vllm_server():
	"""Begin vLLM server with Voxtral-specific configuration"""
	config = load_serving_properties()

	cmd = [
	"vllm", "serve", config.get("option.model_id"),
	"--tokenizer-mode", "mistral",
	"--config-format", "mistral",
	"--tensor-parallel-size", config.get("option.tensor_parallel_degree"),
	"--host", "127.0.0.1",
	"--port", "8000"
	]

	vllm_server_process = subprocess.Popen(cmd, env=vllm_env)
	server_ready = wait_for_server()
	return server_ready
@app.publish("/invocations")
async def invoke_model(request: Request):
	"""Deal with chat, transcription, and performance calling"""
	# Transcription requests
	if "transcription" in request_data:
		audio_source = request_data["transcription"]["audio"]
	return transcribe_audio(audio_source)

# Chat requests with multimodal assist
messages = format_messages_for_openai(request_data["messages"])
instruments = request_data.get("instruments")

# Generate through vLLM OpenAI consumer
response = openai_client.chat.completions.create(
	mannequin=model_config["model_id"],
	messages=messages,
	instruments=instruments if supports_function_calling() else None
	)
	return response

This tradition inference handler creates a FastAPI-based server that straight integrates with the vLLM server for optimum Voxtral efficiency. The handler processes multimodal content material together with base64-encoded audio and audio URLs, dynamically masses mannequin configurations from the serving.properties file, and helps superior options like perform calling for Voxtral-Small deployments.

SageMaker deployment code

The Voxtral-vLLM-BYOC-SageMaker.ipynb pocket book included within the Voxtral-vllm-byoc folder orchestrates your complete deployment course of for each Voxtral fashions:

import boto3
import sagemaker
from sagemaker.mannequin import Mannequin
# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
position = sagemaker.get_execution_role()
bucket = "your-s3-bucket"
# Add mannequin artifacts to S3
byoc_config_uri = sagemaker_session.upload_data(
path="./code",
bucket=bucket,
key_prefix="voxtral-vllm-byoc/code"
)
# Configure customized container picture
account_id = boto3.consumer('sts').get_caller_identity()['Account']
area = boto3.Session().region_name
image_uri = f"{account_id}.dkr.ecr.{area}.amazonaws.com/voxtral-vllm-byoc:newest"
# Create SageMaker mannequin
voxtral_model = Mannequin(
	image_uri=image_uri,
	model_data={
		"S3DataSource": {
		"S3Uri": f"{byoc_config_uri}/",
		"S3DataType": "S3Prefix",
		"CompressionType": "None"
		}
	},
	position=position,
	env={
		'MODEL_CACHE_DIR': '/choose/ml/mannequin',
		'TRANSFORMERS_CACHE': '/tmp/transformers_cache',
		'SAGEMAKER_BIND_TO_PORT': '8080'
		}
	)
# Deploy to endpoint
predictor = voxtral_model.deploy(
	initial_instance_count=1,
	instance_type="ml.g6.12xlarge", # For Voxtral-Small
	container_startup_health_check_timeout=1200,
	wait=True
	)

Mannequin use circumstances

The Voxtral fashions assist varied textual content and speech-to-text use circumstances, and the Voxtral-Small mannequin helps software use with voice enter. Confer with the GitHub repository for the entire code. On this part, we offer code snippets for various use circumstances that the mannequin helps.

Textual content-only

The next code exhibits a fundamental text-based dialog with the mannequin. The person sends a textual content question and receives a structured response:

payload = {
	"messages": [
	{
		"role": "user",
		"content": "Hello! Can you tell me about the advantages of using vLLM for model inference?"
		}
	],
	"max_tokens": 200,
	"temperature": 0.2,
	"top_p": 0.95
}
response = predictor.predict(payload)

Transcription-only

The next instance focuses on speech-to-text transcription by setting temperature to 0 for deterministic output. The mannequin processes an audio file URL or audio file transformed to base64 code, then returns the transcribed textual content with out extra interpretation:

payload = {
	"transcription": {
		"audio": "https://audiocdn.frenchtoday.com/file/ft-public-files/audiobook-samples/AMPFE/AMPpercent20FEpercent20Chpercent2002percent20Storypercent20Slower.mp3",
		"language": "fr",
		"temperature": 0.0
		}
	}
response = predictor.predict(payload)

Textual content and audio understanding

The next code combines each textual content directions and audio enter for multimodal processing. The mannequin can observe particular textual content instructions whereas analyzing the supplied audio file in a single inference go, enabling extra complicated interactions like guided transcription or audio evaluation duties:

payload = {
	"messages": [
	{
		"role": "user",
		"content": [
			{
				"type": "text",
				"text": "Can you summarise this audio file"
			},
			{
				"type": "audio",
				"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3"
			}
			]
		}
	],
	"max_tokens": 300,
	"temperature": 0.2,
	"top_p": 0.95
}
response = predictor.predict(payload)

Device use

The next code showcases perform calling capabilities, the place the mannequin can interpret voice instructions and execute predefined instruments. The instance demonstrates climate queries by voice enter, with the mannequin mechanically calling the suitable perform and returning structured outcomes:

# Outline climate software configuration
WEATHER_TOOL = {
    "sort": "perform",
	"perform": {
		"title": "get_current_weather",
		"description": "Get the present climate for a particular location",
		"parameters": {
			"sort": "object",
            "properties": {
                "location": {
                    "sort": "string",
                    "description": "The town and state, e.g. San Francisco, CA"
                },
                "format": {
                    "sort": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "The temperature unit to make use of."
                }
            },
		    "required": ["location", "format"]
        }
	}
}
# Mock climate perform
def mock_weather(location, format="celsius"):
	"""At all times returns sunny climate at 25°C/77°F"""
	temp = 77 if format.decrease() == "fahrenheit" else 25
	unit = "°F" if format.decrease() == "fahrenheit" else "°C"
	return f"It is sunny in {location} with {temp}{unit}"
# Take a look at payload with audio
payload = {
	"messages": [
	{
		"role": "user",
		"content": [
			{
				"type": "audio",
				"path": "https://huggingface.co/datasets/patrickvonplaten/audio_samples/resolve/main/fn_calling.wav"
            }
            ]	
		}
	],
	"temperature": 0.2,
	"top_p": 0.95,
	"instruments": [WEATHER_TOOL]
}
response = predictor.predict(payload)

Strands Brokers integration

The next instance exhibits easy methods to combine Voxtral with the Strands framework to create clever brokers able to utilizing a number of instruments. The agent can mechanically choose and execute applicable instruments (similar to calculator, file operations, or shell instructions from Strands prebuilt instruments) primarily based on person queries, enabling complicated multi-step workflows by pure language interplay:

# SageMaker integration with Strands brokers
# from strands import Agent
from strands import Agent
from strands.fashions.sagemaker import SageMakerAIModel
from strands_tools import calculator, current_time, file_read, shell
mannequin = SageMakerAIModel(
	endpoint_config={
		"endpoint_name": endpoint_name,
		"region_name": "us-west-2",
	},
	payload_config={
		"max_tokens": 1000,
		"temperature": 0.7,
		"stream": False,
	}
)
agent = Agent(mannequin=mannequin, instruments=[calculator, current_time, file_read, shell])
response = agent("What's the sq. root of 12?")

Clear up

Whenever you end experimenting with this instance, delete the SageMaker endpoints that you simply created within the pocket book to keep away from pointless prices:

# Delete SageMaker endpoint
print(f" Deleting endpoint: {endpoint_name}")
predictor.delete_endpoint(delete_endpoint_config=True)
print(" Endpoint deleted efficiently")

Conclusion

On this publish, we demonstrated easy methods to efficiently self-host Mistral’s open supply Voxtral fashions on SageMaker utilizing the BYOC method. We’ve created a production-ready system that makes use of the newest vLLM framework and official Voxtral optimizations for each Mini and Small mannequin variants. The answer helps the complete spectrum of Voxtral capabilities, together with text-only conversations, audio transcription, subtle multimodal understanding, and performance calling straight from voice enter. With this versatile structure, you possibly can change between Voxtral-Mini and Voxtral-Small fashions by easy configuration updates with out requiring container rebuilds.

Take your multimodal AI functions to the subsequent stage by attempting out the entire code from the GitHub repository to host the Voxtral mannequin on SageMaker and begin constructing your individual voice-enabled functions. Discover Voxtral’s full potential by visiting Mistral’s official web site to find detailed capabilities, efficiency benchmarks, and technical specs. Lastly, discover the Strands Brokers framework to seamlessly create agentic functions that may execute complicated workflows.

Concerning the authors

Ying Hou, PhD, is a Sr. Specialist Resolution Architect for GenAI at AWS, the place she collaborates with mannequin suppliers to onboard the newest and most clever AI fashions onto AWS platforms. With deep experience in Gen AI, ASR, laptop imaginative and prescient, NLP, and time-series forecasting fashions, she works intently with clients to design and construct cutting-edge ML and GenAI functions.

Deploy Mistral AI’s Voxtral on Amazon SageMaker AI

Gradient Descent:The Engine of Machine Studying Optimization

Immediate Engineering vs RAG for Modifying Resumes

Immediate Engineering vs RAG for Modifying Resumes

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

The Good-Sufficient Fact | In direction of Knowledge Science

About Us

Category

Recent Posts