Constructing {custom} mannequin supplier for Strands Brokers with LLMs hosted on SageMaker AI endpoints

Organizations more and more deploy {custom} massive language fashions (LLMs) on Amazon SageMaker AI real-time endpoints utilizing their most popular serving frameworks—corresponding to SGLang, vLLM, or TorchServe—to assist acquire higher management over their deployments, optimize prices, and align with compliance necessities. Nevertheless, this flexibility introduces a important technical problem: response format incompatibility with Strands brokers. Whereas these {custom} serving frameworks sometimes return responses in OpenAI-compatible codecs to facilitate broad atmosphere assist, Strands brokers count on mannequin responses aligned with the Bedrock Messages API format.

The problem is especially important as a result of assist for the Messages API shouldn’t be assured for the fashions hosted on SageMaker AI real-time endpoints. Whereas Amazon Bedrock Mantle distributed inference engine has supported OpenAI messaging codecs since December 2025, flexibility of SageMaker AI permits clients to host varied basis fashions—some requiring esoteric immediate and response codecs that don’t conform to plain APIs. This creates a niche between the serving framework’s output construction and what Strands expects, stopping seamless integration regardless of each programs being technically purposeful. The answer lies in implementing {custom} mannequin parsers that stretch SageMakerAIModel and translate the mannequin server’s response format into what Strands expects, enabling organizations to leverage their most popular serving frameworks with out sacrificing compatibility with the Strands Brokers SDK.

This publish demonstrates learn how to construct {custom} mannequin parsers for Strands brokers when working with LLMs hosted on SageMaker that don’t natively assist the Bedrock Messages API format. We’ll stroll by deploying Llama 3.1 with SGLang on SageMaker utilizing awslabs/ml-container-creator, then implementing a {custom} parser to combine it with Strands brokers.

Strands Customized Parsers

Strands brokers count on mannequin responses in a particular format aligned with the Bedrock Messages API. Whenever you deploy fashions utilizing {custom} serving frameworks like SGLang, vLLM, or TorchServe, they sometimes return responses in their very own codecs—usually OpenAI-compatible for broad atmosphere assist. With no {custom} parser, you’ll encounter errors like:

TypeError: 'NoneType' object shouldn't be subscriptable

This occurs as a result of the Strands Brokers default SageMakerAIModel class makes an attempt to parse responses assuming a particular construction that your {custom} endpoint doesn’t present. On this publish and the companion code base, we illustrate learn how to lengthen the SageMakerAIModel class with {custom} parsing logic that interprets your mannequin server’s response format into what Strands expects.

Implementation Overview

Our implementation consists of three layers:

Mannequin Deployment Layer: Llama 3.1 served by SGLang on SageMaker, returning OpenAI-compatible responses
Parser Layer: Customized LlamaModelProvider class that extends SageMakerAIModel to deal with Llama 3.1’s response format
Agent Layer: Strands agent that makes use of the {custom} supplier for conversational AI, appropriately parsing the mannequin’s response

We begin by utilizing awslabs/ml-container-creator, an AWS Labs open-source Yeoman generator that automates the creation of SageMaker BYOC (Deliver Your Personal Container) deployment tasks. It generates the artifacts wanted to construct LLM serving containers, together with Dockerfiles, CodeBuild configurations, and deployment scripts.

Set up ml-container-creator

Step one we have to take is to construct the serving container for our mannequin. We use an open-source mission to construct the container and generate deployment scripts for that container. The next instructions illustrate learn how to set up awslabs/ml-container-creator and its dependencies, which embrace npm and Yeoman. For extra info, evaluate the mission’s README and Wiki to get began.

# Set up Yeoman globally
npm set up -g yo

# Clone and set up ml-container-creator
git clone https://github.com/awslabs/ml-container-creator
cd ml-container-creator
npm set up && npm hyperlink

# Confirm set up
yo --generators # Ought to present ml-container-creator

Generate Deployment Challenge

As soon as put in and linked, the yo command lets you run put in turbines, yo ml-container-creator lets you run the generator we’d like for this train.

# Run the generator
yo ml-container-creator

# Configuration choices:
# - Framework: transformers
# - Mannequin Server: sglang
# - Mannequin: meta-llama/Llama-3.1-8B-Instruct
# - Deploy Goal: codebuild
# - Occasion Kind: ml.g6.12xlarge (GPU)
# - Area: us-east-1

The generator creates an entire mission construction:

/
├── Dockerfile # Container with SGLang and dependencies
├── buildspec.yml # CodeBuild configuration
├── code/
│ └── serve # SGLang server startup script
├── deploy/
│ ├── submit_build.sh # Triggers CodeBuild
│ └── deploy.sh # Deploys to SageMaker
└── take a look at/
└── test_endpoint.sh # Endpoint testing script

Construct and Deploy

Tasks constructed by awslabs/ml-container-creator embrace templatized construct and deployment scripts. The ./deploy/submit_build.sh and ./deploy/deploy.sh scripts are used to construct the picture, push the picture to Amazon Elastic Container Registry (ECR), and deploy to an Amazon SageMaker AI real-time endpoint.

cd llama-31-deployment

# Construct container with CodeBuild (no native Docker required)
./deploy/submit_build.sh

# Deploy to SageMaker
./deploy/deploy.sh arn:aws:iam::ACCOUNT:function/SageMakerExecutionRole

The deployment course of:

CodeBuild builds the Docker picture with SGLang and Llama 3.1
Picture is pushed to Amazon ECR
SageMaker creates a real-time endpoint
SGLang downloads the mannequin from HuggingFace and hundreds it into GPU reminiscence
Endpoint reaches InService standing (roughly 10-Quarter-hour)

We are able to take a look at the endpoint by utilizing ./take a look at/test_endpoint.sh, or with a direct invocation:

import boto3
import json

runtime_client = boto3.consumer('sagemaker-runtime', region_name="us-east-1")

payload = {
"messages": [
    {"user", "content": "Hello, how are you?"}
  ],
  "max_tokens": 100,
  "temperature": 0.7
}

response = runtime_client.invoke_endpoint(
  EndpointName="llama-31-deployment-endpoint",
  ContentType="utility/json",
  Physique=json.dumps(payload)
)

end result = json.hundreds(response['Body'].learn().decode('utf-8'))
print(end result['choices'][0]['message']['content'])

Understanding the Response Format

Llama 3.1 returns OpenAI-compatible responses. Strands expects mannequin responses to stick to the Bedrock Messages API format. Till late final 12 months, this was a regular compatibility mismatch. Since December 2025, the Amazon Bedrock Mantle distributed inference engine helps OpenAI messaging codecs:

{
  "id": "cmpl-abc123",
  "object": "chat.completion",
  "created": 1704067200,
  "mannequin": "meta-llama/Llama-3.1-8B-Instruct",
  "decisions": [{
    "index": 0,
    "message": {"role": "assistant", "content": "I'm doing well, thank you for asking!"},
    "finish_reason": "stop"
  }],
  "utilization": {
    "prompt_tokens": 23,
    "completion_tokens": 12,
    "total_tokens": 35
  }
}

Nevertheless, assist for the Messages API shouldn’t be assured for the fashions hosted on SageMaker AI real-time endpoints. SageMaker AI permits clients to host many sorts of basis fashions on managed GPU-accelerated infrastructure, a few of which can require esoteric immediate/response codecs. For instance, the default SageMakerAIModel makes use of the legacy Bedrock Messages API format and makes an attempt to entry fields that don’t exist in the usual OpenAI Messages format, inflicting TypeError type failures.

Implementing a Customized Mannequin Parser

Customized mannequin parsers are a function of the Strands Brokers SDK that gives sturdy compatibility and suppleness for purchasers constructing brokers powered by LLMs hosted on SageMaker AI. Right here, we describe learn how to create a {custom} supplier that extends SageMakerAIModel:

def stream(self, messages: Listing[Dict[str, Any]], tool_specs: record, system_prompt: Optionally available[str], **kwargs):
  # Construct payload messages
  payload_messages = []
  if system_prompt:
    payload_messages.append({"function": "system", "content material": system_prompt})
    # Extract message content material from Strands format
    for msg in messages:
      payload_messages.append({"function": "person", "content material": msg['content'][0]['text']})
      
      # Construct full payload with streaming enabled
      payload = {
        "messages": payload_messages,
        "max_tokens": kwargs.get('max_tokens', self.max_tokens),
        "temperature": kwargs.get('temperature', self.temperature),
        "top_p": kwargs.get('top_p', self.top_p),
        "stream": True
      }

      strive:
        # Invoke SageMaker endpoint with streaming
        response = self.runtime_client.invoke_endpoint_with_response_stream(
          EndpointName=self.endpoint_name,
          ContentType="utility/json",
          Settle for="utility/json",
          Physique=json.dumps(payload)
        )

        # Course of streaming response
        accumulated_content = ""
          for occasion in response['Body']:
            chunk = occasion['PayloadPart']['Bytes'].decode('utf-8')
            if not chunk.strip():
              proceed
    
            # Parse SSE format: "knowledge: {json}n"
            for line in chunk.break up('n'):
              if line.startswith('knowledge: '):
                strive:
                  json_str = line.change('knowledge: ', '').strip()
                  if not json_str:
                    proceed
                  
                  chunk_data = json.hundreds(json_str)
                  if 'decisions' in chunk_data and chunk_data['choices']:
                    delta = chunk_data['choices'][0].get('delta', {})

                    # Yield content material delta in Strands format
                    if 'content material' in delta:
                      content_chunk = delta['content']
                      accumulated_content += content_chunk
                      yield {
                        "sort": "contentBlockDelta",
                        "delta": {"textual content": content_chunk},
                        "contentBlockIndex": 0
                      }

                    # Test for completion
                    finish_reason = chunk_data['choices'][0].get('finish_reason')
                    if finish_reason:
                      yield {
                        "sort": "messageStop",
                        "stopReason": finish_reason
                      }

                    # Yield utilization metadata
                    if 'utilization' in chunk_data:
                      yield {
                        "sort": "metadata",
                        "utilization": chunk_data['usage']
                      }

                besides json.JSONDecodeError:
                  proceed

      besides Exception as e:
        yield {
          "sort": "error",
          "error": {
            "message": f"Endpoint invocation failed: {str(e)}",
            "sort": "EndpointInvocationError"
          }
      }

The stream methodology overrides the habits of the SageMakerAIModel and permits the agent to parse responses based mostly on the necessities of the underlying mannequin. Whereas the overwhelming majority of fashions do assist OpenAI’s Message API protocol, this functionality allows power-users to leverage extremely specified LLMs on SageMaker AI to energy agent workloads utilizing Strands Brokers SDK. As soon as the {custom} mannequin response logic is constructed, Strands Brokers SDK makes it easy to initialize brokers with {custom} mannequin suppliers:

from strands.agent import Agent

# Initialize {custom} supplier
supplier = LlamaModelProvider(
  endpoint_name="llama-31-deployment-endpoint",
  region_name="us-east-1",
  max_tokens=1000,
  temperature=0.7
)

# Create agent with {custom} supplier
agent = Agent(
  identify="llama-assistant",
  mannequin=supplier,
  system_prompt=(
    "You're a useful AI assistant powered by Llama 3.1, "
    "deployed on Amazon SageMaker. You present clear, correct, "
    "and pleasant responses to person questions."
  )
)

# Check the agent
response = agent("What are the important thing advantages of deploying LLMs on SageMaker?")
print(response.content material)

The entire implementation for this tradition parser, together with the Jupyter pocket book with detailed explanations and the ml-container-creator deployment mission, is offered within the companion GitHub repository.

Conclusion

Constructing {custom} mannequin parsers for Strands brokers helps customers to leverage totally different LLM deployments on SageMaker, no matter its response format. By extending SageMakerAIModel and implementing the stream() methodology, you may combine custom-hosted fashions whereas sustaining the clear agent interface of Strands.

Key takeaways:

awslabs/ml-container-creator simplifies SageMaker BYOC deployments with production-ready infrastructure code
Customized parsers bridge the hole between mannequin server response codecs and Strands expectations
The stream() methodology is the important integration level for {custom} suppliers

Concerning the authors

Dan Ferguson is a Sr. Options Architect at AWS, based mostly in New York, USA. As a machine studying providers professional, Dan works to assist clients on their journey to integrating ML workflows effectively, successfully, and sustainably.

Constructing {custom} mannequin supplier for Strands Brokers with LLMs hosted on SageMaker AI endpoints

Vector Databases vs. Graph RAG for Agent Reminiscence: When to Use Which

Understanding Context and Contextual Retrieval in RAG

Understanding Context and Contextual Retrieval in RAG

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

How Cursor Really Indexes Your Codebase

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

The Good-Sufficient Fact | In direction of Knowledge Science

About Us

Category

Recent Posts