Optimizing LLM inference on Amazon SageMaker AI with BentoML’s LLM- Optimizer

The rise of highly effective giant language fashions (LLMs) that may be consumed through API calls has made it remarkably simple to combine synthetic intelligence (AI) capabilities into purposes. But regardless of this comfort, a major variety of enterprises are selecting to self-host their very own fashions—accepting the complexity of infrastructure administration, the price of GPUs within the serving stack, and the problem of retaining fashions up to date. The choice to self-host typically comes down to 2 crucial elements that APIs can’t deal with. First, there’s information sovereignty: the necessity to ensure that delicate data doesn’t go away the infrastructure, whether or not resulting from regulatory necessities, aggressive considerations, or contractual obligations with prospects. Second, there’s mannequin customization: the power to fantastic tune fashions on proprietary information units for industry-specific terminology and workflows or create specialised capabilities that general-purpose APIs can’t supply.

Amazon SageMaker AI addresses the infrastructure complexity of self-hosting by abstracting away the operational burden. By way of managed endpoints, SageMaker AI handles the provisioning, scaling, and monitoring of GPU assets, permitting groups to deal with mannequin efficiency relatively than infrastructure administration. The system gives inference-optimized containers with common frameworks like vLLM pre-configured for optimum throughput and minimal latency. For example, the Giant Mannequin Inference (LMI) v16 container picture makes use of vLLM v0.10.2, which makes use of the V1 engine and comes with help for brand new mannequin architectures and new {hardware}, such because the Blackwell/SM100 technology. This managed method transforms what usually requires devoted machine studying operations (MLOps) experience right into a deployment course of that takes only a few strains of code.

Reaching optimum efficiency with these managed containers nonetheless requires cautious configuration. Parameters like tensor parallelism diploma, batch dimension, most sequence size, and concurrency limits can dramatically affect each latency and throughput—and discovering the appropriate steadiness to your particular workload and value constraints is an iterative course of that may be time-consuming.

BentoML’s LLM-Optimizer addresses this problem by enabling systematic benchmarking throughout totally different parameter configurations, changing guide trial-and-error with an automatic search course of. The device lets you outline constraints akin to particular latency targets or throughput necessities, making it simple to determine configurations that meet your service degree aims. You should use LLM-Optimizer to search out optimum serving parameters for vLLM regionally or in your improvement surroundings, apply those self same configurations on to the SageMaker AI endpoint for a seamless transition to manufacturing. This publish illustrates this course of by discovering an optimum deployment for a Qwen-3-4B mannequin on an Amazon SageMaker AI endpoint.

This publish is written for working towards ML engineers, options architects, and system builders who already deploy fashions on Amazon SageMaker or comparable infrastructure. We assume familiarity with GPU cases, endpoints, and mannequin serving, and deal with sensible efficiency optimization. The reasons of inference metrics are included not as a newbie tutorial, however to construct shared instinct. For particular parameters like batch dimension & tensor parallelism, and the way they instantly affect price and latency in manufacturing.

Answer overview

The step-by-step breakdown is as follows:

Outline constraints in Jupyter Pocket book: The method begins inside SageMaker AI Studio, the place customers open a Jupyter Pocket book to outline the deployment targets and constraints of the use case. These constraints can embrace goal latency, desired throughput, and output tokens.
Run theoretical and empirical benchmarks with the BentoML LLM-Optimizer: The LLM-Optimizer first runs a theoretical GPU efficiency estimate to determine possible configurations for the chosen {hardware} (on this instance, an ml.g6.12xlarge). It executes benchmark assessments utilizing the vLLM serving engine throughout a number of parameter combos akin to tensor parallelism, batch dimension, and sequence size to empirically measure latency and throughput. Primarily based on these benchmarks, the optimizer mechanically determines probably the most environment friendly serving configuration that satisfies the offered constraints.
Generate and deploy optimized configuration in a SageMaker endpoint: As soon as the benchmarking is full, the optimizer returns a JSON configuration file containing the optimum parameter values. This JSON is handed from the Jupyter Pocket book to the SageMaker Endpoint configuration, which deploys the LLM (on this instance, the Qwen/Qwen3-4B mannequin utilizing the vLLM-based LMI container) in a managed HTTP endpoint utilizing the optimum runtime parameters.

The next determine is an summary of the workflow carried out all through the publish.

Earlier than leaping into the theoretical underpinnings of inference optimization, it’s value grounding why these ideas matter within the context of real-world deployments. When groups transfer from API-based fashions to self-hosted endpoints, they inherit the accountability for tuning efficiency parameters that instantly have an effect on price and person expertise. Understanding how latency and throughput work together by way of the lens of GPU structure and arithmetic depth permits engineers to make these trade-offs intentionally relatively than by trial and error.

Transient overview of LLM efficiency

Earlier than diving into the sensible utility of this workflow, we cowl key ideas that construct instinct for why inference optimization is crucial for LLM-powered purposes. The next primer isn’t educational; it’s to offer the psychological mannequin wanted to interpret LLM-Optimizer’s outputs and perceive why sure configurations yield higher outcomes.

Key efficiency metrics

Throughput (requests/second): What number of requests your system completes per second. Greater throughput means serving extra customers concurrently.

Latency (seconds): The entire time from when a request arrives till the whole response is returned. Decrease latency means sooner person expertise.

Arithmetic depth: The ratio of computation carried out to information moved. This determines whether or not your workload is:

Reminiscence-bound: Restricted by how briskly you possibly can transfer information (low arithmetic depth)

Compute-bound: Restricted by uncooked GPU processing energy (excessive arithmetic depth)

The roofline mannequin

The roofline mannequin visualizes efficiency by plotting throughput towards arithmetic depth. For deeper content material on the roofline mannequin, go to the AWS Neuron Batching documentation. The mannequin reveals whether or not your utility is bottlenecked by reminiscence bandwidth or computational capability. For LLM inference, this mannequin helps determine when you’re restricted by:

Reminiscence bandwidth: Information switch between GPU reminiscence and compute models (typical for small batch sizes)
Compute capability: Uncooked floating-point operations (FLOPS) obtainable on the GPU (typical for giant batch sizes)

The throughput-latency trade-off

In follow, optimizing LLM inference follows a elementary trade-off: as you enhance throughput, latency rises. This occurs as a result of:

Bigger batch sizes → Extra requests processed collectively → Greater throughput
Extra concurrent requests → Longer queue wait occasions → Greater latency
Tensor parallelism → Distributes mannequin throughout GPUs → Impacts each metrics otherwise

The problem lies to find the optimum configuration throughout a number of interdependent parameters:

Tensor parallelism diploma (what number of GPUs to make use of)
Batch dimension (most variety of tokens processed collectively)
Concurrency limits (most variety of simultaneous requests)
KV cache allocation (reminiscence for consideration states)

Every parameter impacts throughput and latency otherwise whereas respecting {hardware} constraints like GPU reminiscence and compute bandwidth. This multi-dimensional optimization drawback is exactly why LLM-Optimizer is efficacious—it systematically explores the configuration house relatively than counting on guide trial-and-error.

For an summary on LLM Inference as an entire, BentoML has offered invaluable assets of their LLM Inference Handbook.

Sensible utility: Discovering an optimum deployment of Qwen3-4B on Amazon SageMaker AI

Within the following sections, we stroll by way of a hands-on instance of figuring out and making use of optimum serving configurations for LLM deployment. Particularly, we:

Deploy the Qwen/Qwen3-4B mannequin utilizing vLLM on an ml.g6.12xlarge occasion (4x NVIDIA L4 GPUs, 24GB VRAM every).
Outline real looking workload constraints:
- Goal: 10 requests per second (RPS)
- Enter size: 1,024 tokens
- Output size: 512 tokens
Discover a number of serving parameter combos:
- Tensor parallelism diploma (1, 2, or 4 GPUs)
- Max batched tokens (4K, 8K, 16K)
- Concurrency ranges (32, 64, 128)
Analyze outcomes utilizing:
- Theoretical GPU reminiscence calculations
- Benchmarking information
- Throughput vs. latency trade-offs

By the top, you’ll see how theoretical evaluation, empirical benchmarking, and managed endpoint deployment come collectively to ship a production-ready LLM setup that balances latency, throughput, and value.

Conditions

The next are the conditions wanted to run by way of this instance:

Entry to SageMaker Studio. This makes deployment & inference simple, or an interactive improvement surroundings (IDE) akin to PyCharm or Visible Studio Code.
To benchmark and deploy the mannequin, examine that the really useful occasion sorts are accessible, based mostly on the mannequin dimension. To confirm the mandatory service quotas, full the next steps:
- On the Service Quotas console, beneath AWS Companies, choose Amazon SageMaker.
- Confirm adequate quota for the required occasion sort for “endpoint deployment” (within the right area).
- If wanted, request a quota enhance/contact AWS for help.

The next code particulars set up the mandatory packages:

pip set up vllm
pip set up git+https://github.com/bentoml/llm-optimizer.git

Run the LLM-Optimizer

To get began, instance constraints should be outlined based mostly on the focused workflow.

Instance constraints:

Enter tokens: 1024
Output tokens: 512
E2E Latency: <= 60 seconds
Throughput: >= 5 RPS

Run the estimate

Step one with llm-optimizer is to run an estimation. Working an estimate analyzes the Qwen/Qwen3-4b mannequin on 4x L4 GPUs and estimate the efficiency for an enter size of 1024 tokens, and an output of 512 tokens. As soon as run, the theoretical bests for latency and throughput are calculated mathematically and returned. The roofline evaluation returned identifies the workloads bottlenecks, and quite a few server and consumer arguments are returned, to be used within the following step, working the precise benchmark.

Beneath the hood, LLM-Optimizer performs roofline evaluation to estimate LLM serving efficiency. It begins by fetching the mannequin structure from HuggingFace to extract parameters like hidden dimensions, variety of layers, consideration heads, and whole parameters. Utilizing these architectural particulars, it calculates the theoretical FLOPs required for each prefill (processing enter tokens) and decode (producing output tokens) phases, accounting for consideration operations, MLP layers, and KV cache entry patterns. It compares the arithmetic depth (FLOPs per byte moved) of every part towards the GPU’s {hardware} traits—particularly the ratio of compute capability (TFLOPs) to reminiscence bandwidth (TB/s)—to find out whether or not prefill and decode are memory-bound or compute-bound. From this evaluation, the device estimates TTFT (time-to-first-token), ITL (inter-token latency), and end-to-end latency at varied concurrency ranges. It additionally calculates three theoretical concurrency limits: KV cache reminiscence capability, prefill compute capability, and decode throughput capability. Lastly, it generates tuning instructions that sweep throughout totally different tensor parallelism configurations, batch sizes, and concurrency ranges for empirical benchmarking to validate the theoretical predictions.

The next code particulars run an preliminary estimation based mostly on the chosen constraints:

llm-optimizer estimate  
--model Qwen/Qwen3-4B  
--input-len 1024  
--output-len 512  
--gpu L40  
--num-gpus 4

Anticipated output:

Auto-detected 4 GPU(s)
💡 Inferred precision from mannequin config: bf16

=== Configuration ===
Mannequin: Qwen/Qwen3-4B
GPU: 4x L40
Precision: bf16
Enter/Output: 1024/512 tokens
Goal: throughput

Fetching mannequin configuration...
Mannequin: 3668377600.0B parameters, 36 layers

=== Efficiency Evaluation ===
Finest Latency (concurrency=1):
  TTFT: 16.8 ms
  ITL: 1.4 ms
  E2E: 0.72 s

Finest Throughput (concurrency=1024):
  Output: 21601.0 tokens/s
  Enter: 61062.1 tokens/s
  Requests: 24.71 req/s
  Bottleneck: Reminiscence

=== Roofline Evaluation ===
{Hardware} Ops/Byte Ratio: 195.1 ops/byte
Prefill Arithmetic Depth: 31846.2 ops/byte
Decode Arithmetic Depth: 31.1 ops/byte
Prefill Section: Compute Sure
Decode Section: Reminiscence Sure

=== Concurrency Evaluation ===
KV Cache Reminiscence Restrict: 1258 concurrent requests
Prefill Compute Restrict: 21 concurrent requests
Decode Capability Restrict: 25 concurrent requests
Theoretical Total Restrict: 21 concurrent requests
Empirical Optimum Concurrency: 16 concurrent requests

=== Tuning Instructions ===

--- VLLM ---
Easy (concurrency + TP/DP):
  llm-optimizer --framework vllm --model Qwen/Qwen3-4B --gpus 4 --host 127.0.0.1 --server-args "tensor_parallel_size*data_parallel_size=[(1, 4), (2, 2), (4, 1)]" --client-args "dataset_name=random;random_input_len=1024;random_output_len=512;random_range_ratio=0.95;num_prompts=3072;max_concurrency=[512, 1024, 1536]" --output-dir tuning_results --output-json tuning_results/config_1_vllm.json
Superior (extra parameters):
  llm-optimizer --framework vllm --model Qwen/Qwen3-4B --gpus 4 --host 127.0.0.1 --server-args "tensor_parallel_size*data_parallel_size=[(1, 4), (2, 2), (4, 1)];max_num_batched_tokens=[16384, 24576, 32768]" --client-args "dataset_name=random;random_input_len=1024;random_output_len=512;random_range_ratio=0.95;num_prompts=3072;max_concurrency=[512, 1024, 1536]" --output-dir tuning_results --output-json tuning_results/config_1_vllm.json

Run the benchmark

With the estimation outputs in hand, an knowledgeable choice will be made on what parameters to make use of for benchmarking based mostly on the beforehand outlined constraints. Beneath the hood, LLM-Optimizer transitions from theoretical estimation to empirical validation by launching a distributed benchmarking loop that evaluates real-world serving efficiency on the goal {hardware}. For every permutation of server and consumer arguments, the device mechanically spins up a vLLM occasion with the desired tensor parallelism, batch dimension, and token limits, then drives load utilizing an artificial or dataset-based request generator (e.g., ShareGPT). Every run captures low-level metrics—time-to-first-token (TTFT), inter-token latency (ITL), end-to-end latency, tokens per second, and GPU reminiscence utilization—throughout concurrent request patterns. These measurements are aggregated right into a Pareto frontier, permitting LLM-Optimizer to determine configurations that finest steadiness latency and throughput inside the person’s constraints. In essence, this step grounds the sooner theoretical roofline evaluation in actual efficiency information, producing reproducible metrics that instantly inform deployment tuning.

The next code runs the benchmark, utilizing data from the estimate:

llm-optimizer 
  --framework vllm 
  --model Qwen/Qwen3-4B 
  --server-args "tensor_parallel_size=[1,2,4];max_num_batched_tokens=[4096,8192,16384]" 
  --client-args "max_concurrency=[32,64,128];num_prompts=1000;dataset_name=sharegpt" 
  --output-json vllm_results.json

This outputs the next permutations to the vLLM engine for testing. The next are easy calculations on the totally different combos of consumer & server arguments that the benchmark runs:

3 tensor_parallel_size x 3 max_num_batched_tokens settings = 9
3 max_concurrency x 1 num prompts = 3
9 * 3 = 27 totally different assessments

As soon as accomplished, three artifacts are generated:

An HTML file containing a Pareto dashboard of the outcomes: An interactive visualization that highlights the trade-offs between latency and throughput throughout the examined configurations.
A JSON file summarizing the benchmark outcomes: This compact output aggregates the important thing efficiency metrics (e.g., latency, throughput, GPU utilization) for every take a look at permutation and is used for programmatic evaluation or downstream automation.
A JSONL file containing the complete document of particular person benchmark runs: Every line represents a single take a look at configuration with detailed metadata, enabling fine-grained inspection, filtering, or customized plotting.

Instance benchmark document output:

{"config": {"client_args": {"max_concurrency": 32, "num_prompts": 1000, "dataset_name": "sharegpt"}, "server_args": {"tensor_parallel_size": 4, "max_num_batched_tokens": 8192}, "server_cmd_args": ["--tensor-parallel-size=4", "--max-num-batched-tokens=8192"]}, "outcomes": {"backend": "vllm", "dataset_name": "sharegpt", "max_concurrency": 32, "length": 178.69010206999883, "accomplished": 1000, "total_input_tokens": 302118, "total_output_tokens": 195775, "total_output_tokens_retokenized": 195764, "request_throughput": 5.5962808707125085, "input_throughput": 1690.7371840979215, "output_throughput": 1095.6118874637414, "mean_e2e_latency_ms": 5516.473195931989, "median_e2e_latency_ms": 3601.3218250000136, "std_e2e_latency_ms": 6086.249975393793, "p95_e2e_latency_ms": 17959.23558074991, "p99_e2e_latency_ms": 23288.202798799084, "mean_ttft_ms": 134.24923809297798, "median_ttft_ms": 75.87540699933015, "std_ttft_ms": 219.7887602629944, "p95_ttft_ms": 315.9690581494033, "p99_ttft_ms": 1222.5397153301492, "mean_tpot_ms": 28.140094508604655, "median_tpot_ms": 27.28665116875758, "std_tpot_ms": 7.497764233364623, "p95_tpot_ms": 36.30593537913286, "p99_tpot_ms": 48.05242155004177, "mean_itl_ms": 27.641122410215683, "median_itl_ms": 21.38108600047417, "std_itl_ms": 28.983685761892183, "p95_itl_ms": 64.98022639971161, "p99_itl_ms": 133.48110956045272, "concurrency": 30.871733420192484, "accept_length": null}, "cmd": "vllm serve Qwen/Qwen3-4B --host 127.0.0.1 --port 8000 --tensor-parallel-size=4 --max-num-batched-tokens=8192", "constraints": [], "metadata": {"gpu_type": "NVIDIA L4", "gpu_count": 4, "model_tag": "Qwen/Qwen3-4B", "input_tokens": -1, "output_tokens": -1}}
{"config": {"client_args": {"max_concurrency": 64, "num_prompts": 1000, "dataset_name": "sharegpt"}, "server_args": {"tensor_parallel_size": 4, "max_num_batched_tokens": 8192}, "server_cmd_args": ["--tensor-parallel-size=4", "--max-num-batched-tokens=8192"]}, "outcomes": {"backend": "vllm", "dataset_name": "sharegpt", "max_concurrency": 64, "length": 151.1696548789987, "accomplished": 1000, "total_input_tokens": 302118, "total_output_tokens": 195775, "total_output_tokens_retokenized": 195768, "request_throughput": 6.615084229704922, "input_throughput": 1998.5360173099916, "output_throughput": 1295.068115070481, "mean_e2e_latency_ms": 8939.159275709007, "median_e2e_latency_ms": 6008.622306500911, "std_e2e_latency_ms": 9605.635172303826, "p95_e2e_latency_ms": 27139.969452801306, "p99_e2e_latency_ms": 37183.75254391998, "mean_ttft_ms": 251.3472756509782, "median_ttft_ms": 116.74506849976751, "std_ttft_ms": 491.6096066277092, "p95_ttft_ms": 1224.981592999029, "p99_ttft_ms": 2902.0978502906837, "mean_tpot_ms": 48.65581712437634, "median_tpot_ms": 45.59879392866151, "std_tpot_ms": 31.47685312628492, "p95_tpot_ms": 65.96288688333136, "p99_tpot_ms": 130.59083745436504, "mean_itl_ms": 44.61668980280019, "median_itl_ms": 33.35350599991216, "std_itl_ms": 44.581804322583615, "p95_itl_ms": 111.47860099845275, "p99_itl_ms": 222.5829249997332, "concurrency": 59.133291551563126, "accept_length": null}, "cmd": "vllm serve Qwen/Qwen3-4B --host 127.0.0.1 --port 8000 --tensor-parallel-size=4 --max-num-batched-tokens=8192", "constraints": [], "metadata": {"gpu_type": "NVIDIA L4", "gpu_count": 4, "model_tag": "Qwen/Qwen3-4B", "input_tokens": -1, "output_tokens": -1}}
{"config": {"client_args": {"max_concurrency": 128, "num_prompts": 1000, "dataset_name": "sharegpt"}, "server_args": {"tensor_parallel_size": 4, "max_num_batched_tokens": 8192}, "server_cmd_args": ["--tensor-parallel-size=4", "--max-num-batched-tokens=8192"]}, "outcomes": {"backend": "vllm", "dataset_name": "sharegpt", "max_concurrency": 128, "length": 133.0894289429998, "accomplished": 1000, "total_input_tokens": 302118, "total_output_tokens": 195775, "total_output_tokens_retokenized": 195771, "request_throughput": 7.513744765020255, "input_throughput": 2270.0375409183894, "output_throughput": 1471.0033813718405, "mean_e2e_latency_ms": 14910.240386960006, "median_e2e_latency_ms": 10384.713371499856, "std_e2e_latency_ms": 15223.620712896502, "p95_e2e_latency_ms": 43486.963950149395, "p99_e2e_latency_ms": 61421.81745829036, "mean_ttft_ms": 663.0696945789732, "median_ttft_ms": 189.89979050093098, "std_ttft_ms": 1407.5295299267668, "p95_ttft_ms": 4652.777336598592, "p99_ttft_ms": 7000.883197711337, "mean_tpot_ms": 91.83800469031593, "median_tpot_ms": 77.46479336456856, "std_tpot_ms": 94.19538916493616, "p95_tpot_ms": 125.3206487750731, "p99_tpot_ms": 500.0748501195875, "mean_itl_ms": 73.16857466775902, "median_itl_ms": 49.85373300041829, "std_itl_ms": 72.57371615955182, "p95_itl_ms": 172.3669967985188, "p99_itl_ms": 328.1056552407972, "concurrency": 112.03174065271433, "accept_length": null}, "cmd": "vllm serve Qwen/Qwen3-4B --host 127.0.0.1 --port 8000 --tensor-parallel-size=4 --max-num-batched-tokens=8192", "constraints": [], "metadata": {"gpu_type": "NVIDIA L4", "gpu_count": 4, "model_tag": "Qwen/Qwen3-4B", "input_tokens": -1, "output_tokens": -1}}

Unpacking the benchmark outcomes, we will use the metrics p99 e2e latency and request throughput at varied ranges of concurrency to make an knowledgeable choice. The benchmark outcomes revealed that tensor parallelism of 4 throughout the obtainable GPUs persistently outperformed decrease parallelism settings, with the optimum configuration being tensor_parallel_size=4, max_num_batched_tokens=8192, and max_concurrency=128, reaching 7.51 requests/second and a pair of,270 enter tokens/second—a 2.7x throughput enchancment over the naive single-GPU baseline (2.74 req/s).Whereas this configuration delivered peak throughput, it got here with elevated p99 end-to-end latency of 61.4 seconds beneath heavy load; for latency-sensitive workloads, the candy spot was tensor_parallel_size=4 with max_num_batched_tokens=4096 at reasonable concurrency (32), which maintained sub-24-second p99 latency whereas nonetheless delivering 5.63 req/s—greater than double the baseline throughput. The info demonstrates that shifting from a naive single-GPU setup to optimized 4-way tensor parallelism with tuned batch sizes can unlock substantial efficiency features, with the precise configuration selection relying on whether or not the deployment prioritizes most throughput or latency assurances.

To visualise the outcomes, LLM-Optimizer gives a handy perform to view the outputs plotted in a Pareto dashboard. The Pareto dashboard will be displayed with the next line of code:

llm-optimizer visualize --data-file vllm_results.json --port 8080 --serve

With the proper artifacts now in hand, the mannequin with the proper configurations will be deployed.

Deploying to Amazon SageMaker AI

With the optimum serving parameters recognized by way of LLM-Optimizer, the ultimate step is to deploy the tuned mannequin into manufacturing. Amazon SageMaker AI gives a really perfect surroundings for this transition, abstracting away the infrastructure complexity of distributed GPU internet hosting whereas preserving fine-grained management over inference parameters. Through the use of LMI containers, builders can deploy open-source frameworks like vLLM at scale, with out managing CUDA dependencies, GPU scheduling, or load balancing manually.

SageMaker AI LMI containers are high-performance Docker pictures particularly designed for LLM inference. These containers combine natively with frameworks akin to vLLM and TensorRT, and supply built-in help for multi-GPU tensor parallelism, steady batching, streaming token technology, and different optimizations crucial to low-latency serving. The LMI v16 container used on this instance contains vLLM v0.10.2 and the V1 engine, supporting new mannequin architectures and bettering each latency and throughput in comparison with earlier variations.

Now that one of the best quantitative values for inference serving have been decided, these configurations will be handed on to the container as surroundings variables. (please refer right here for in-depth steerage):

env = {
    "HF_MODEL_ID": "Qwen/Qwen3-4B",
    "OPTION_ASYNC_MODE": "true",
    "OPTION_ROLLING_BATCH": "disable",
    "OPTION_ENTRYPOINT": "djl_python.lmi_vllm.vllm_async_service",
    "OPTION_MAX_ROLLING_BATCH_PREFILL_TOKENS": "8192",
    "OPTION_TENSOR_PARALLEL_DEGREE": "4",
}

When these surroundings variables are utilized, SageMaker mechanically injects them into the container’s runtime configuration layer, which initializes the vLLM engine with the specified arguments. Throughout startup, the container downloads the mannequin weights from Hugging Face, configures the GPU topology for tensor parallel execution throughout the obtainable units (on this case, on the ml.g6.12xlarge occasion), and registers the mannequin with the SageMaker Endpoint Runtime. This makes positive that the mannequin runs with the identical optimized settings validated by LLM-Optimizer, bridging the hole between experimentation and manufacturing deployment.

The next code demonstrates bundle and deploy the mannequin for real-time inference on SageMaker AI:

image_uri = f"763104351884.dkr.ecr.{area}.amazonaws.com/djl-inference:0.34.0-lmi16.0.0-cu128"
model_name = name_from_base("qwen3-4b-stateful")

create_model = sm_client.create_model(
ModelName = model_name,
ExecutionRoleArn = position,
PrimaryContainer = {
"Picture": image_uri,
"Setting": env,
},
)
model_arn = create_model["ModelArn"]

As soon as the mannequin assemble is created, you possibly can create and activate the endpoint:

create_endpoint = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

After deployment, the endpoint is able to deal with reside site visitors and will be invoked instantly for inference:

request = {
    "messages": [
            {"role": "user", "content": "What is Amazon SageMaker?"}
            ],
            "max_tokens": 50,
            "temperature": 0.75,
            "cease": None
}

response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Physique=json.dumps(request),
    ContentType="utility/json",
)
response = response_model["Body"].learn()
response = “Amazon SageMaker is AWS's absolutely managed machine studying service that permits builders and information scientists to construct, prepare, and deploy machine studying fashions at scale.”

These code snippets exhibit the deployment circulate conceptually. For an entire end-to-end pattern on deploying an LMI container for actual time inference on SageMaker AI, discuss with this instance.

Conclusion

The journey from mannequin choice to manufacturing deployment now not must depend on trial and error. By combining BentoML’s LLM-Optimizer with Amazon SageMaker AI, organizations can now transfer from speculation to deployment by way of a data-driven, automated optimization loop. This workflow replaces guide parameter tuning with a repeatable course of that quantifies efficiency trade-offs, aligns with business-level latency and throughput aims, and deploys one of the best configuration instantly right into a managed inference surroundings. This workflow addresses a crucial problem in manufacturing LLM deployment: with out systematic optimization, groups face an costly guessing sport between over-provisioning GPU assets and risking degraded person expertise. As demonstrated on this walkthrough, the efficiency variations are substantial—misconfigured setups can require 2-4x extra GPUs whereas delivering 2-3x increased latency. What might historically take an engineer days or even weeks of guide trial-and-error testing turns into a couple of hours of automated benchmarking. By combining LLM-Optimizer’s clever configuration search with SageMaker AI’s managed infrastructure, groups could make data-driven deployment selections that instantly affect each cloud prices and person satisfaction, focusing their efforts on constructing differentiated AI experiences relatively than tuning inference parameters.

The mixture of automated benchmarking and managed large-model deployment represents a major step ahead in making enterprise AI each accessible and economically environment friendly. By leveraging LLM-Optimizer for clever configuration search and SageMaker AI for scalable, fault-tolerant internet hosting, groups can deal with constructing differentiated AI experiences relatively than managing infrastructure or tuning inference stacks manually. Finally, one of the best LLM configuration isn’t simply the one which runs quickest—it’s the one which meets particular latency, throughput, and value targets in manufacturing. With BentoML’s LLM-Optimizer and Amazon SageMaker AI, that steadiness will be found systematically, reproduced persistently, and deployed confidently.

Extra assets

Concerning the authors

Josh Longenecker is a Generative AI/ML Specialist Options Architect at AWS, partnering with prospects to architect and deploy cutting-edge AI/ML options. He’s a part of the Neuron Information Science Skilled TFC and obsessed with pushing boundaries within the quickly evolving AI panorama. Outdoors of labor, you’ll discover him on the health club, open air, or having fun with time along with his household.

Mohammad Tahsin is a Generative AI/ML Specialist Options Architect at AWS, the place he works with prospects to design, optimize, and deploy fashionable AI/ML options. He’s obsessed with steady studying and staying on the frontier of recent capabilities within the subject. In his free time, he enjoys gaming, digital artwork, and cooking.

Optimizing LLM inference on Amazon SageMaker AI with BentoML’s LLM- Optimizer

Assume Your Python Code Is Sluggish? Cease Guessing and Begin Measuring

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

The Good-Sufficient Fact | In direction of Knowledge Science

About Us

Category

Recent Posts