Saying OpenAI-compatible API assist for Amazon SageMaker AI endpoints

At present, Amazon SageMaker AI introduces OpenAI-compatible API assist for real-time inference endpoints. Should you use the OpenAI SDK, LangChain, or Strands Brokers, now you can invoke fashions on SageMaker AI by altering solely your endpoint URL. You don’t want a customized shopper, a SigV4 wrapper, or code rewrites.

Overview

With this launch, SageMaker AI endpoints expose an /openai/v1 path that accepts Chat Completions requests and returns responses as is from the container, together with streaming. OpenAI endpoints are turned on for all endpoints and inference elements utilizing commonplace SageMaker AI APIs and SDK.

SageMaker AI routes primarily based on the endpoint identify within the URL, so any OpenAI-compatible shopper works out of the field. Now you can create time-limited bearer tokens on your endpoints and use them together with your OpenAI shoppers.

For a working instance that features deployment and invocation, see the accompanying pocket book on GitHub.

“We run AI coding brokers that use a number of LLM suppliers by means of an LLM gateway (Bifrost) talking the OpenAI chat completions protocol. The bearer token function lets us add SageMaker as a drop-in OpenAI-compatible inference endpoint — no customized SigV4 signing — so it really works natively with our gateway, Vercel AI SDK, and commonplace OpenAI shoppers.” says Giorgio Piatti (AI/ML Engineer – Caffeine.AI)

Use instances

Agentic workflows on owned infrastructure

Should you construct multi-step AI brokers with frameworks like Strands Brokers or LangChain, now you can run these workflows fully by yourself SageMaker AI endpoints. Your brokers name fashions utilizing the identical OpenAI-compatible interface they have been constructed on, however inference runs on devoted GPU situations in your individual account.

Multi-model internet hosting with a single interface

Should you run a number of fashions—for instance, Llama for basic duties, a fine-tuned Mistral for domain-specific work, and a smaller mannequin for classification—you’ll be able to host all of them on a single SageMaker AI endpoint utilizing inference elements. Every mannequin will get its personal useful resource allocation, and each one is callable by means of the identical OpenAI SDK. You don’t want separate API shoppers or routing logic in utility code.

Serving fine-tuned fashions with out code adjustments

Should you fine-tune open supply fashions on your particular use case, you’ll be able to deploy them on SageMaker AI and name them by means of the identical OpenAI-compatible interface that your purposes already use. The one change is the endpoint URL. The remainder of the appliance—the SDK calls, the streaming logic, the immediate formatting—stays the identical.

Resolution overview

On this submit, we stroll by means of the next:

How bearer token authentication works with SageMaker AI endpoints.
Deploying and invoking a single-model endpoint.
Deploying and invoking inference elements for multi-model deployments.
Integration with the Strands Brokers framework.

Stipulations

To comply with together with this walkthrough, you need to have the next:

An AWS account with permissions to create SageMaker AI endpoints.
The SageMaker Python SDK (pip set up sagemaker).
The OpenAI Python SDK (pip set up openai).
A mannequin saved in Amazon Easy Storage Service (Amazon S3). For instance, Qwen3-4B downloaded from Hugging Face.
An AWS Id and Entry Administration (IAM) execution function to create the endpoints, with the AmazonSageMakerFullAccess coverage.
An IAM execution function with the sagemaker:CallWithBearerToken and sagemaker:InvokeEndpoint permissions to invoke the endpoint.

Authentication with bearer tokens

SageMaker AI OpenAI-compatible endpoints use bearer token authentication. The SageMaker Python SDK features a token generator that creates time-limited tokens (legitimate for as much as 12 hours) out of your present AWS credentials. No extra secrets and techniques or API keys are required.

The token accommodates your function or consumer credentials, and it requires the sagemaker:CallWithBearerToken and sagemaker:InvokeEndpoint motion permissions.

Generate a token

Use the next Python script to generate a token.

from sagemaker.core.token_generator import generate_token
from datetime import timedelta

token = generate_token(area="us-west-2", expiry=timedelta(minutes=5))

The token generator makes use of no matter AWS credentials can be found in your surroundings: IAM consumer credentials, an occasion profile on Amazon Elastic Compute Cloud (Amazon EC2), or an AWS IAM Id Heart (SSO) session.

The generate_token operate generates a time-limited bearer token for authenticating with SageMaker APIs. By default, tokens are legitimate for 12 hours, although you’ll be able to override this with the expiry parameter utilizing a timedelta worth anyplace between 1 second and 12 hours. The operate accepts a area, an non-obligatory aws_credentials_provider, and the expiry period. If no AWS Area is offered, it falls again to the AWS_REGION surroundings variable. If no credentials supplier is equipped, it resolves credentials utilizing the default AWS credential chain, which searches a number of sources, together with surroundings variables, ~/.aws/credentials, ~/.aws/config, container credentials, and occasion profiles. For the total decision order, see the Boto3 credentials documentation.

Auto-refresh tokens for long-running purposes

For purposes that run repeatedly, you’ll be able to implement an auto-refreshing sample utilizing httpx so {that a} contemporary token is generated on every request:

import httpx
from sagemaker.core.token_generator import generate_token

class SageMakerAuth(httpx.Auth):
    def __init__(self, area: str):
        self.area = area

    def auth_flow(self, request):
        request.headers["Authorization"] = f"Bearer {generate_token(area=self.area)}"
        yield request

http_client = httpx.Shopper(auth=SageMakerAuth(area="us-west-2"))

IAM permissions

The IAM function or consumer invoking the endpoint wants the next permissions:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Effect": "Allow",
            "Action": "sagemaker:InvokeEndpoint",
            "Resource": "arn:aws:sagemaker:::endpoint/"
        },
        {
            "Effect": "Allow",
            "Action": "sagemaker:CallWithBearerToken",
            "Resource": "*"
        }
    ]
}

As a greatest observe, all the time prohibit the Useful resource to particular endpoint ARNs for InvokeEndpoint fairly than utilizing a wildcard. The bearer token generated from this function has the identical degree of entry, so a narrowly scoped coverage limits the blast radius if a token is inadvertently uncovered. Word that CallWithBearerToken requires a wildcard ("*") for the Useful resource discipline. It doesn’t assist resource-level restrictions.

How the token works

The bearer token is a base64-encoded SigV4 pre-signed URL. If you name generate_token, the SageMaker AI SDK constructs a request to the SageMaker AI service for the CallWithBearerToken motion, indicators it domestically utilizing your AWS credentials, and encodes the ensuing signed URL as a conveyable token string. No community name is made throughout token technology. The signing occurs fully on the shopper facet. If you current this token to a SageMaker AI endpoint, the service decodes it, validates the SigV4 signature, verifies that the token hasn’t expired, and confirms that the originating IAM identification has the required permissions. The token’s efficient lifetime is the lesser of the expiry worth and the remaining validity of the AWS credentials used to signal it.

Safety greatest observe: The bearer token carries the identical authorization because the underlying AWS credentials used to generate it. Deal with tokens with the identical care as credentials. Scope the IAM function used for token technology to the minimal permissions required, particularly sagemaker:InvokeEndpoint and sagemaker:CallWithBearerToken on solely the endpoint ARNs that the caller must entry. Don’t generate tokens from roles with expansive permissions, equivalent to these granted by AdministratorAccess or SageMakerFullAccess managed insurance policies.

Don’t retailer tokens on disk, in surroundings variables, in configuration information, in databases, or in distributed caches. Don’t log tokens, and solely transmit them over encrypted communication protocols equivalent to HTTPS. Token technology is an area operation with no community overhead, so the beneficial observe is to generate a contemporary token on the level of use or use the auto-refreshing httpx.Auth sample proven within the previous instance. This avoids the chance of token leakage and helps you employ a token with most remaining validity. As a greatest observe, set the token expiry to the shortest period your workload requires.

Deploy a single-model endpoint

A single-model endpoint hosts one mannequin and serves requests instantly. The next instance deploys Qwen3-4B utilizing the SageMaker AI vLLM Deep Studying Container on an ml.g6.2xlarge occasion.

Word: SageMaker AI endpoints incur prices whereas in service, no matter site visitors. For extra particulars, see the Amazon SageMaker AI pricing web page.

import boto3
import sagemaker
import time
from sagemaker.core.helper.session_helper import Session
from sagemaker.core.helper.session_helper import get_execution_role

# AWS configuration
REGION = "us-west-2"

# Mechanically resolve account ID and default SageMaker execution function
session = Session(boto_session=boto3.Session(region_name=REGION))
ACCOUNT_ID = boto3.shopper("sts", region_name=REGION).get_caller_identity()["Account"]
EXECUTION_ROLE = get_execution_role(sagemaker_session=session)

# HF Mannequin ID
MODEL_HF_ID = "Qwen/Qwen3-4B"

# SageMaker vLLM Deep Studying Container
VLLM_IMAGE = f"763104351884.dkr.ecr.{REGION}.amazonaws.com/vllm:0.20.2-gpu-py312-cu130-ubuntu22.04-sagemaker"

# Occasion kind (1x NVIDIA L4 GPU)
INSTANCE_TYPE = "ml.g6.2xlarge"

sagemaker_client = boto3.shopper("sagemaker", region_name=REGION)

print(f"Area: {REGION}")
print(f"Account ID: {ACCOUNT_ID}")
print(f"Execution function: {EXECUTION_ROLE}")
print(f"Mannequin HF ID: {MODEL_HF_ID}")

import time

TIMESTAMP = str(int(time.time()))
SME_MODEL_NAME = f"openai-compat-sme-model-{TIMESTAMP}"
SME_ENDPOINT_CONFIG_NAME = f"openai-compat-sme-epc-{TIMESTAMP}"
SME_ENDPOINT_NAME = f"openai-compat-sme-ep-{TIMESTAMP}"

print(f"Timestamp suffix: {TIMESTAMP}")
print(f"Mannequin: {SME_MODEL_NAME}")
print(f"Endpoint config: {SME_ENDPOINT_CONFIG_NAME}")
print(f"Endpoint: {SME_ENDPOINT_NAME}")

sagemaker_client.create_model(
    ModelName=SME_MODEL_NAME,
    ExecutionRoleArn=EXECUTION_ROLE,
    PrimaryContainer={
        "Picture": VLLM_IMAGE,
        "Atmosphere": {
            "HF_MODEL_ID": MODEL_HF_ID,
            "SM_VLLM_TENSOR_PARALLEL_SIZE": "1",
            "SM_VLLM_MAX_NUM_SEQS": "4",
            "SM_VLLM_ENABLE_AUTO_TOOL_CHOICE": "true",
            "SM_VLLM_TOOL_CALL_PARSER": "hermes",
            "SAGEMAKER_ENABLE_LOAD_AWARE": "1",
        },
    },
)
print(f"Mannequin created: {SME_MODEL_NAME}")

sagemaker_client.create_endpoint_config(
    EndpointConfigName=SME_ENDPOINT_CONFIG_NAME,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": SME_MODEL_NAME,
            "InstanceType": INSTANCE_TYPE,
            "InitialInstanceCount": 1,
        }
    ],
)
print(f"Endpoint configuration created: {SME_ENDPOINT_CONFIG_NAME}")

sagemaker_client.create_endpoint(
    EndpointName=SME_ENDPOINT_NAME,
    EndpointConfigName=SME_ENDPOINT_CONFIG_NAME,
)
print(f"Endpoint creation initiated: {SME_ENDPOINT_NAME}")

print("Ready for endpoint to succeed in InService standing (this takes 5-10 minutes)...")
waiter = sagemaker_client.get_waiter("endpoint_in_service")
waiter.wait(
    EndpointName=SME_ENDPOINT_NAME,
    WaiterConfig={"Delay": 30, "MaxAttempts": 40},
)
print(f"Endpoint is InService: {SME_ENDPOINT_NAME}")

The endpoint transitions to InService standing inside a couple of minutes. When prepared, it serves each the usual SageMaker AI /invocations path and the OpenAI-compatible path at /openai/v1/chat/completions.

Invoke a single-model endpoint

With the endpoint in service, invoke it utilizing the OpenAI Python SDK. The bottom URL follows this format:

https://runtime.sagemaker..amazonaws.com/endpoints//openai/v1

from openai import OpenAI
from sagemaker.core.token_generator import generate_token

REGION = "us-west-2"

sme_base_url = f"https://runtime.sagemaker.{REGION}.amazonaws.com/endpoints/{SME_ENDPOINT_NAME}/openai/v1"

shopper = OpenAI(
    base_url=sme_base_url,
    api_key=generate_token(area=REGION)
)

print(f"Base URL: {sme_base_url}")

stream = shopper.chat.completions.create(
    mannequin="",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain how transformers work in machine learning, in three sentences."},
    ],
    stream=True,
)

for chunk in stream:
    if chunk.selections[0].delta.content material:
        print(chunk.selections[0].delta.content material, finish="")
print()

The mannequin discipline is handed by means of to the container. As a result of SageMaker AI routes requests primarily based on the endpoint identify within the URL, you’ll be able to preserve this discipline empty or set it to match the mannequin identify your container expects.

Deploy an inference element endpoint

With inference elements, you’ll be able to host a number of fashions on a single endpoint, every with devoted compute useful resource allocations. With inference elements, the mannequin is related to the element fairly than the endpoint configuration:

IC_MODEL_NAME = f"openai-compat-ic-model-{TIMESTAMP}"
IC_ENDPOINT_CONFIG_NAME = f"openai-compat-ic-epc-{TIMESTAMP}"
IC_ENDPOINT_NAME = f"openai-compat-ic-ep-{TIMESTAMP}"
IC_NAME = f"openai-compat-ic-qwen3-4b-{TIMESTAMP}"

print(f"Mannequin: {IC_MODEL_NAME}")
print(f"Endpoint config: {IC_ENDPOINT_CONFIG_NAME}")
print(f"Endpoint: {IC_ENDPOINT_NAME}")
print(f"Inference comp: {IC_NAME}")

sagemaker_client.create_model(
    ModelName=IC_MODEL_NAME,
    ExecutionRoleArn=EXECUTION_ROLE,
    PrimaryContainer={
        "Picture": VLLM_IMAGE,
        "Atmosphere": {
            "HF_MODEL_ID": MODEL_HF_ID,
            "SM_VLLM_TENSOR_PARALLEL_SIZE": "1",
            "SM_VLLM_MAX_NUM_SEQS": "4",
            "SM_VLLM_ENABLE_AUTO_TOOL_CHOICE": "true",
            "SM_VLLM_TOOL_CALL_PARSER": "hermes",
            "SAGEMAKER_ENABLE_LOAD_AWARE": "1",
        },
    },
)
print(f"Mannequin created: {IC_MODEL_NAME}")

sagemaker_client.create_endpoint_config(
    EndpointConfigName=IC_ENDPOINT_CONFIG_NAME,
    ExecutionRoleArn=EXECUTION_ROLE,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "InstanceType": INSTANCE_TYPE,
            "InitialInstanceCount": 1,
        }
    ],
)
print(f"Endpoint configuration created: {IC_ENDPOINT_CONFIG_NAME}")

sagemaker_client.create_endpoint(
    EndpointName=IC_ENDPOINT_NAME,
    EndpointConfigName=IC_ENDPOINT_CONFIG_NAME,
)
print(f"Endpoint creation initiated: {IC_ENDPOINT_NAME}")

print("Ready for endpoint to succeed in InService standing (this takes 5-10 minutes)...")
waiter = sagemaker_client.get_waiter("endpoint_in_service")
waiter.wait(
    EndpointName=IC_ENDPOINT_NAME,
    WaiterConfig={"Delay": 30, "MaxAttempts": 40},
)
print(f"Endpoint is InService: {IC_ENDPOINT_NAME}")

sagemaker_client.create_inference_component(
    InferenceComponentName=IC_NAME,
    EndpointName=IC_ENDPOINT_NAME,
    VariantName="variant1",
    Specification={
        "ModelName": IC_MODEL_NAME,
        "ComputeResourceRequirements": {
            "MinMemoryRequiredInMb": 1024,
            "NumberOfCpuCoresRequired": 2,
            "NumberOfAcceleratorDevicesRequired": 1,
        },
    },
    RuntimeConfig={"CopyCount": 1},
)
print(f"Inference element creation initiated: {IC_NAME}")

print("Ready for inference element to succeed in InService standing...")
whereas True:
    desc = sagemaker_client.describe_inference_component(InferenceComponentName=IC_NAME)
    standing = desc["InferenceComponentStatus"]
    if standing == "InService":
        print(f"Inference element is InService: {IC_NAME}")
        break
    elif standing == "Failed":
        increase RuntimeError(f"Inference element failed: {desc.get('FailureReason', 'unknown')}")
    time.sleep(30)

You possibly can create extra inference elements on the identical endpoint to host a number of fashions with impartial scaling and useful resource allocation.

Invoke inference elements

To invoke a selected inference element, embody its identify within the URL path:

https://runtime.sagemaker..amazonaws.com/endpoints//inference-components//openai/v1

The next instance reveals two inference elements on a shared endpoint, every focused by a separate OpenAI shopper that shares a connection pool:

import httpx
from openai import OpenAI
from sagemaker.core.token_generator import generate_token

shared_http = httpx.Shopper()

client_a = OpenAI(
    base_url=(
        f"https://runtime.sagemaker.{REGION}.amazonaws.com"
        f"/endpoints/{IC_ENDPOINT_NAME}/inference-components/{IC_NAME}/openai/v1"
    ),
    api_key=generate_token(area=REGION),
    http_client=shared_http,
)

response = client_a.chat.completions.create(
    mannequin="",
    messages=[{"role": "user", "content": "What is 42 * 3? Reply with the number."}],
)
print(f"Response: {response.selections[0].message.content material}")
print(f"Connection pool lively: shared_http is reusable throughout a number of IC shoppers")

The shared httpx.Shopper permits each OpenAI shopper situations to reuse the identical TLS classes and connection pool.

Combine with Strands Brokers

Strands Brokers is an open supply SDK for constructing AI brokers. As a result of Strands Brokers helps OpenAI-compatible mannequin suppliers, now you can run multi-agent workflows fully by yourself SageMaker AI infrastructure. This offers you the flexibleness of agentic purposes with the management of devoted endpoints. Your knowledge by no means leaves your account, and also you select precisely which mannequin model your brokers run.

from openai import AsyncOpenAI
from strands import Agent, software
from strands.fashions.openai import OpenAIModel
from sagemaker.core.token_generator import generate_token

@software
def calculator(expression: str) -> str:
    """Consider a math expression."""
    return str(eval(expression))

strands_client = AsyncOpenAI(
    base_url=f"https://runtime.sagemaker.{REGION}.amazonaws.com/endpoints/{SME_ENDPOINT_NAME}/openai/v1",
    api_key=generate_token(area=REGION),
)

mannequin = OpenAIModel(shopper=strands_client, model_id="", params={"temperature": 0.7})

coder = Agent(
    mannequin=mannequin,
    system_prompt=(
        "You're an knowledgeable Python developer. Write clear, well-documented "
        "Python code with kind hints. Output ONLY the code, no rationalization."
    ),
    instruments=[calculator],
)

reviewer = Agent(
    mannequin=mannequin,
    system_prompt=(
        "You're a senior code reviewer. Overview Python code for correctness, "
        "efficiency, and PEP 8 type. Give a concise assessment with particular solutions."
    ),
    instruments=[calculator],
)

Clear up

To keep away from ongoing prices, delete your endpoints and related assets if you’re carried out. SageMaker AI endpoints incur prices whereas in service, no matter whether or not they’re receiving site visitors.

import boto3
sagemaker_client = boto3.shopper("sagemaker", region_name="us-west-2")

sagemaker_client.delete_inference_component(InferenceComponentName="")
sagemaker_client.delete_endpoint(EndpointName="")
sagemaker_client.delete_endpoint_config(EndpointConfigName="")
sagemaker_client.delete_model(ModelName="")

Conclusion

With OpenAI-compatible API assist, Amazon SageMaker AI removes the mixing barrier between the place most AI purposes are right this moment and the infrastructure they should scale. You possibly can preserve your present code, use any OpenAI-compatible framework, and run inference on devoted endpoints with the GPU, scaling, and knowledge residency controls you want. To get began, deploy a mannequin on a SageMaker AI real-time endpoint utilizing a supported container, set up the SageMaker Python SDK, and level your OpenAI shopper on the endpoint URL. To be taught extra, see Use SageMaker AI with OpenAI-compatible APIs within the Amazon SageMaker AI Developer Information, or open the Amazon SageMaker AI console to create your first endpoint.