At present, Amazon SageMaker AI introduces OpenAI-compatible API assist for real-time inference endpoints. Should you use the OpenAI SDK, LangChain, or Strands Brokers, now you can invoke fashions on SageMaker AI by altering solely your endpoint URL. You don’t want a customized shopper, a SigV4 wrapper, or code rewrites.
Overview
With this launch, SageMaker AI endpoints expose an /openai/v1 path that accepts Chat Completions requests and returns responses as is from the container, together with streaming. OpenAI endpoints are turned on for all endpoints and inference elements utilizing commonplace SageMaker AI APIs and SDK.
SageMaker AI routes primarily based on the endpoint identify within the URL, so any OpenAI-compatible shopper works out of the field. Now you can create time-limited bearer tokens on your endpoints and use them together with your OpenAI shoppers.
For a working instance that features deployment and invocation, see the accompanying pocket book on GitHub.
“We run AI coding brokers that use a number of LLM suppliers by means of an LLM gateway (Bifrost) talking the OpenAI chat completions protocol. The bearer token function lets us add SageMaker as a drop-in OpenAI-compatible inference endpoint — no customized SigV4 signing — so it really works natively with our gateway, Vercel AI SDK, and commonplace OpenAI shoppers.” says Giorgio Piatti (AI/ML Engineer – Caffeine.AI)
Use instances
Agentic workflows on owned infrastructure
Should you construct multi-step AI brokers with frameworks like Strands Brokers or LangChain, now you can run these workflows fully by yourself SageMaker AI endpoints. Your brokers name fashions utilizing the identical OpenAI-compatible interface they have been constructed on, however inference runs on devoted GPU situations in your individual account.
Multi-model internet hosting with a single interface
Should you run a number of fashions—for instance, Llama for basic duties, a fine-tuned Mistral for domain-specific work, and a smaller mannequin for classification—you’ll be able to host all of them on a single SageMaker AI endpoint utilizing inference elements. Every mannequin will get its personal useful resource allocation, and each one is callable by means of the identical OpenAI SDK. You don’t want separate API shoppers or routing logic in utility code.
Serving fine-tuned fashions with out code adjustments
Should you fine-tune open supply fashions on your particular use case, you’ll be able to deploy them on SageMaker AI and name them by means of the identical OpenAI-compatible interface that your purposes already use. The one change is the endpoint URL. The remainder of the appliance—the SDK calls, the streaming logic, the immediate formatting—stays the identical.
Resolution overview
On this submit, we stroll by means of the next:
- How bearer token authentication works with SageMaker AI endpoints.
- Deploying and invoking a single-model endpoint.
- Deploying and invoking inference elements for multi-model deployments.
- Integration with the Strands Brokers framework.
Stipulations
To comply with together with this walkthrough, you need to have the next:
- An AWS account with permissions to create SageMaker AI endpoints.
- The SageMaker Python SDK (
pip set up sagemaker). - The OpenAI Python SDK (
pip set up openai). - A mannequin saved in Amazon Easy Storage Service (Amazon S3). For instance, Qwen3-4B downloaded from Hugging Face.
- An AWS Id and Entry Administration (IAM) execution function to create the endpoints, with the
AmazonSageMakerFullAccesscoverage. - An IAM execution function with the
sagemaker:CallWithBearerTokenandsagemaker:InvokeEndpointpermissions to invoke the endpoint.
Authentication with bearer tokens
SageMaker AI OpenAI-compatible endpoints use bearer token authentication. The SageMaker Python SDK features a token generator that creates time-limited tokens (legitimate for as much as 12 hours) out of your present AWS credentials. No extra secrets and techniques or API keys are required.
The token accommodates your function or consumer credentials, and it requires the sagemaker:CallWithBearerToken and sagemaker:InvokeEndpoint motion permissions.
Generate a token
Use the next Python script to generate a token.
The token generator makes use of no matter AWS credentials can be found in your surroundings: IAM consumer credentials, an occasion profile on Amazon Elastic Compute Cloud (Amazon EC2), or an AWS IAM Id Heart (SSO) session.
The generate_token operate generates a time-limited bearer token for authenticating with SageMaker APIs. By default, tokens are legitimate for 12 hours, although you’ll be able to override this with the expiry parameter utilizing a timedelta worth anyplace between 1 second and 12 hours. The operate accepts a area, an non-obligatory aws_credentials_provider, and the expiry period. If no AWS Area is offered, it falls again to the AWS_REGION surroundings variable. If no credentials supplier is equipped, it resolves credentials utilizing the default AWS credential chain, which searches a number of sources, together with surroundings variables, ~/.aws/credentials, ~/.aws/config, container credentials, and occasion profiles. For the total decision order, see the Boto3 credentials documentation.
Auto-refresh tokens for long-running purposes
For purposes that run repeatedly, you’ll be able to implement an auto-refreshing sample utilizing httpx so {that a} contemporary token is generated on every request:
IAM permissions
The IAM function or consumer invoking the endpoint wants the next permissions:
As a greatest observe, all the time prohibit the Useful resource to particular endpoint ARNs for InvokeEndpoint fairly than utilizing a wildcard. The bearer token generated from this function has the identical degree of entry, so a narrowly scoped coverage limits the blast radius if a token is inadvertently uncovered. Word that CallWithBearerToken requires a wildcard ("*") for the Useful resource discipline. It doesn’t assist resource-level restrictions.
How the token works
The bearer token is a base64-encoded SigV4 pre-signed URL. If you name generate_token, the SageMaker AI SDK constructs a request to the SageMaker AI service for the CallWithBearerToken motion, indicators it domestically utilizing your AWS credentials, and encodes the ensuing signed URL as a conveyable token string. No community name is made throughout token technology. The signing occurs fully on the shopper facet. If you current this token to a SageMaker AI endpoint, the service decodes it, validates the SigV4 signature, verifies that the token hasn’t expired, and confirms that the originating IAM identification has the required permissions. The token’s efficient lifetime is the lesser of the expiry worth and the remaining validity of the AWS credentials used to signal it.
Safety greatest observe: The bearer token carries the identical authorization because the underlying AWS credentials used to generate it. Deal with tokens with the identical care as credentials. Scope the IAM function used for token technology to the minimal permissions required, particularly sagemaker:InvokeEndpoint and sagemaker:CallWithBearerToken on solely the endpoint ARNs that the caller must entry. Don’t generate tokens from roles with expansive permissions, equivalent to these granted by AdministratorAccess or SageMakerFullAccess managed insurance policies.
Don’t retailer tokens on disk, in surroundings variables, in configuration information, in databases, or in distributed caches. Don’t log tokens, and solely transmit them over encrypted communication protocols equivalent to HTTPS. Token technology is an area operation with no community overhead, so the beneficial observe is to generate a contemporary token on the level of use or use the auto-refreshing httpx.Auth sample proven within the previous instance. This avoids the chance of token leakage and helps you employ a token with most remaining validity. As a greatest observe, set the token expiry to the shortest period your workload requires.
Deploy a single-model endpoint
A single-model endpoint hosts one mannequin and serves requests instantly. The next instance deploys Qwen3-4B utilizing the SageMaker AI vLLM Deep Studying Container on an ml.g6.2xlarge occasion.
Word: SageMaker AI endpoints incur prices whereas in service, no matter site visitors. For extra particulars, see the Amazon SageMaker AI pricing web page.
The endpoint transitions to InService standing inside a couple of minutes. When prepared, it serves each the usual SageMaker AI /invocations path and the OpenAI-compatible path at /openai/v1/chat/completions.
Invoke a single-model endpoint
With the endpoint in service, invoke it utilizing the OpenAI Python SDK. The bottom URL follows this format:
The mannequin discipline is handed by means of to the container. As a result of SageMaker AI routes requests primarily based on the endpoint identify within the URL, you’ll be able to preserve this discipline empty or set it to match the mannequin identify your container expects.
Deploy an inference element endpoint
With inference elements, you’ll be able to host a number of fashions on a single endpoint, every with devoted compute useful resource allocations. With inference elements, the mannequin is related to the element fairly than the endpoint configuration:
You possibly can create extra inference elements on the identical endpoint to host a number of fashions with impartial scaling and useful resource allocation.
Invoke inference elements
To invoke a selected inference element, embody its identify within the URL path:
The next instance reveals two inference elements on a shared endpoint, every focused by a separate OpenAI shopper that shares a connection pool:
The shared httpx.Shopper permits each OpenAI shopper situations to reuse the identical TLS classes and connection pool.
Combine with Strands Brokers
Strands Brokers is an open supply SDK for constructing AI brokers. As a result of Strands Brokers helps OpenAI-compatible mannequin suppliers, now you can run multi-agent workflows fully by yourself SageMaker AI infrastructure. This offers you the flexibleness of agentic purposes with the management of devoted endpoints. Your knowledge by no means leaves your account, and also you select precisely which mannequin model your brokers run.
Clear up
To keep away from ongoing prices, delete your endpoints and related assets if you’re carried out. SageMaker AI endpoints incur prices whereas in service, no matter whether or not they’re receiving site visitors.
Conclusion
With OpenAI-compatible API assist, Amazon SageMaker AI removes the mixing barrier between the place most AI purposes are right this moment and the infrastructure they should scale. You possibly can preserve your present code, use any OpenAI-compatible framework, and run inference on devoted endpoints with the GPU, scaling, and knowledge residency controls you want. To get began, deploy a mannequin on a SageMaker AI real-time endpoint utilizing a supported container, set up the SageMaker Python SDK, and level your OpenAI shopper on the endpoint URL. To be taught extra, see Use SageMaker AI with OpenAI-compatible APIs within the Amazon SageMaker AI Developer Information, or open the Amazon SageMaker AI console to create your first endpoint.
In regards to the authors

