Open basis fashions (FMs) enable organizations to construct custom-made AI purposes by fine-tuning for his or her particular domains or duties, whereas retaining management over prices and deployments. Nevertheless, deployment generally is a good portion of the hassle, typically requiring 30% of mission time as a result of engineers should fastidiously optimize occasion varieties and configure serving parameters by way of cautious testing. This course of may be each complicated and time-consuming, requiring specialised information and iterative testing to realize the specified efficiency.
Amazon Bedrock Customized Mannequin Import simplifies deployments of customized fashions by providing a simple API for mannequin deployment and invocation. You’ll be able to add mannequin weights and let AWS deal with an optimum, totally managed deployment. This makes certain that deployments are performant and value efficient. Amazon Bedrock Customized Mannequin Import additionally handles automated scaling, together with scaling to zero. When not in use and there are not any invocations for five minutes, it scales to zero. You pay just for what you employ in 5-minute increments. It additionally handles scaling up, routinely rising the variety of lively mannequin copies when increased concurrency is required. These options make Amazon Bedrock Customized Mannequin Import a beautiful resolution for organizations trying to make use of customized fashions on Amazon Bedrock offering simplicity and cost-efficiency.
Earlier than deploying these fashions in manufacturing, it’s essential to judge their efficiency utilizing benchmarking instruments. These instruments assist to proactively detect potential manufacturing points akin to throttling and confirm that deployments can deal with anticipated manufacturing hundreds.
This put up begins a weblog sequence exploring DeepSeek and open FMs on Amazon Bedrock Customized Mannequin Import. It covers the method of efficiency benchmarking of customized fashions in Amazon Bedrock utilizing standard open supply instruments: LLMPerf and LiteLLM. It features a pocket book that features step-by-step directions to deploy a DeepSeek-R1-Distill-Llama-8B mannequin, however the identical steps apply for some other mannequin supported by Amazon Bedrock Customized Mannequin Import.
Stipulations
This put up requires an Amazon Bedrock customized mannequin. Should you don’t have one in your AWS account but, observe the directions from Deploy DeepSeek-R1 distilled Llama fashions with Amazon Bedrock Customized Mannequin Import.
Utilizing open supply instruments LLMPerf and LiteLLM for efficiency benchmarking
To conduct efficiency benchmarking, you’ll use LLMPerf, a preferred open-source library for benchmarking basis fashions. LLMPerf simulates load assessments on mannequin invocation APIs by creating concurrent Ray Purchasers and analyzing their responses. A key benefit of LLMPerf is vast help of basis mannequin APIs. This contains LiteLLM, which helps all fashions out there on Amazon Bedrock.
Organising your customized mannequin invocation with LiteLLM
LiteLLM is a flexible open supply software that can be utilized each as a Python SDK and a proxy server (AI gateway) for accessing over 100 totally different FMs utilizing a standardized format. LiteLLM standardizes inputs to match every FM supplier’s particular endpoint necessities. It helps Amazon Bedrock APIs, together with InvokeModel
and Converse APIs, and FMs out there on Amazon Bedrock, together with imported customized fashions.
To invoke a customized mannequin with LiteLLM, you employ the mannequin parameter (see Amazon Bedrock documentation on LiteLLM). This can be a string that follows the bedrock/provider_route/model_arn
format.
The provider_route
signifies the LiteLLM implementation of request/response specification to make use of. DeepSeek R1 fashions may be invoked utilizing their customized chat template utilizing the DeepSeek R1 supplier route, or with the Llama chat template utilizing the Llama supplier route.
The model_arn
is the mannequin Amazon Useful resource Identify (ARN) of the imported mannequin. You will get the mannequin ARN of your imported mannequin within the console or by sending a ListImportedModels request.
For instance, the next script invokes the customized mannequin utilizing the DeepSeek R1 chat template.
import time
from litellm import completion
whereas True:
attempt:
response = completion(
mannequin=f"bedrock/deepseek_r1/{model_id}",
messages=[{"role": "user", "content": """Given the following financial data:
- Company A's revenue grew from $10M to $15M in 2023
- Operating costs increased by 20%
- Initial operating costs were $7M
Calculate the company's operating margin for 2023. Please reason step by step."""},
{"role": "assistant", "content": ""}],
max_tokens=4096,
)
print(response['choices'][0]['message']['content'])
break
besides:
time.sleep(60)
After the invocation parameters for the imported mannequin have been verified, you possibly can configure LLMPerf for benchmarking.
Configuring a token benchmark check with LLMPerf
To benchmark efficiency, LLMPerf makes use of Ray, a distributed computing framework, to simulate reasonable hundreds. It spawns a number of distant shoppers, every able to sending concurrent requests to mannequin invocation APIs. These shoppers are carried out as actors that execute in parallel. llmperf.requests_launcher
manages the distribution of requests throughout the Ray Purchasers, and permits for simulation of varied load eventualities and concurrent request patterns. On the identical time, every shopper will accumulate efficiency metrics through the requests, together with latency, throughput, and error charges.
Two important metrics for efficiency embody latency and throughput:
- Latency refers back to the time it takes for a single request to be processed.
- Throughput measures the variety of tokens which are generated per second.
Deciding on the correct configuration to serve FMs usually entails experimenting with totally different batch sizes whereas carefully monitoring GPU utilization and contemplating elements akin to out there reminiscence, mannequin dimension, and particular necessities of the workload. To be taught extra, see Optimizing AI responsiveness: A sensible information to Amazon Bedrock latency-optimized inference. Though Amazon Bedrock Customized Mannequin Import simplifies this by providing pre-optimized serving configurations, it’s nonetheless essential to confirm your deployment’s latency and throughput.
Begin by configuring token_benchmark.py
, a pattern script that facilitates the configuration of a benchmarking check. Within the script, you possibly can outline parameters akin to:
- LLM API: Use LiteLLM to invoke Amazon Bedrock customized imported fashions.
- Mannequin: Outline the route, API, and mannequin ARN to invoke equally to the earlier part.
- Imply/customary deviation of enter tokens: Parameters to make use of within the chance distribution from which the variety of enter tokens will likely be sampled.
- Imply/customary deviation of output tokens: Parameters to make use of within the chance distribution from which the variety of output tokens will likely be sampled.
- Variety of concurrent requests: The variety of customers that the appliance is prone to help when in use.
- Variety of accomplished requests: The overall variety of requests to ship to the LLM API within the check.
The next script reveals an instance of how one can invoke the mannequin. See this pocket book for step-by-step directions on importing a customized mannequin and working a benchmarking check.
python3 ${{LLM_PERF_SCRIPT_DIR}}/token_benchmark_ray.py
--model "bedrock/llama/{model_id}"
--mean-input-tokens {mean_input_tokens}
--stddev-input-tokens {stddev_input_tokens}
--mean-output-tokens {mean_output_tokens}
--stddev-output-tokens {stddev_output_tokens}
--max-num-completed-requests ${{LLM_PERF_MAX_REQUESTS}}
--timeout 1800
--num-concurrent-requests ${{LLM_PERF_CONCURRENT}}
--results-dir "${{LLM_PERF_OUTPUT}}"
--llm-api litellm
--additional-sampling-params '{{}}'
On the finish of the check, LLMPerf will output two JSON information: one with combination metrics, and one with separate entries for each invocation.
Scale to zero and cold-start latency
One factor to recollect is that as a result of Amazon Bedrock Customized Mannequin Import will scale right down to zero when the mannequin is unused, you’ll want to first make a request to ensure that there’s no less than one lively mannequin copy. Should you receive an error indicating that the mannequin isn’t prepared, you’ll want to wait for about ten seconds and as much as 1 minute for Amazon Bedrock to arrange no less than one lively mannequin copy. When prepared, run a check invocation once more, and proceed with benchmarking.
Instance situation for DeepSeek-R1-Distill-Llama-8B
Contemplate a DeepSeek-R1-Distill-Llama-8B
mannequin hosted on Amazon Bedrock Customized Mannequin Import, supporting an AI utility with low visitors of not more than two concurrent requests. To account for variability, you possibly can modify parameters for token rely for prompts and completions. For instance:
- Variety of shoppers: 2
- Imply enter token rely: 500
- Normal deviation enter token rely: 25
- Imply output token rely: 1000
- Normal deviation output token rely: 100
- Variety of requests per shopper: 50
This illustrative check takes roughly 8 minutes. On the finish of the check, you’ll receive a abstract of outcomes of combination metrics:
inter_token_latency_s
p25 = 0.010615988283217918
p50 = 0.010694698716183695
p75 = 0.010779359342088015
p90 = 0.010945443657517748
p95 = 0.01100556307365132
p99 = 0.011071086908721675
imply = 0.010710014800224604
min = 0.010364670612635254
max = 0.011485444453299149
stddev = 0.0001658793389904756
ttft_s
p25 = 0.3356793452499005
p50 = 0.3783651359990472
p75 = 0.41098671700046907
p90 = 0.46655246950049334
p95 = 0.4846706690498647
p99 = 0.6790834719300077
imply = 0.3837810468001226
min = 0.1878921090010408
max = 0.7590946710006392
stddev = 0.0828713133225014
end_to_end_latency_s
p25 = 9.885957818500174
p50 = 10.561580732000039
p75 = 11.271923759749825
p90 = 11.87688222009965
p95 = 12.139972019549713
p99 = 12.6071144856102
imply = 10.406450886010116
min = 2.6196457750011177
max = 12.626598834998731
stddev = 1.4681851822617253
request_output_throughput_token_per_s
p25 = 104.68609252502657
p50 = 107.24619111072519
p75 = 108.62997591951486
p90 = 110.90675007239598
p95 = 113.3896235445618
p99 = 116.6688412475626
imply = 107.12082450567561
min = 97.0053466021563
max = 129.40680882698936
stddev = 3.9748004356837137
number_input_tokens
p25 = 484.0
p50 = 500.0
p75 = 514.0
p90 = 531.2
p95 = 543.1
p99 = 569.1200000000001
imply = 499.06
min = 433
max = 581
stddev = 26.549294727074212
number_output_tokens
p25 = 1050.75
p50 = 1128.5
p75 = 1214.25
p90 = 1276.1000000000001
p95 = 1323.75
p99 = 1372.2
imply = 1113.51
min = 339
max = 1392
stddev = 160.9598415942952
Quantity Of Errored Requests: 0
Total Output Throughput: 208.0008834264341
Quantity Of Accomplished Requests: 100
Accomplished Requests Per Minute: 11.20784995697034
Along with the abstract, you’ll obtain metrics for particular person requests that can be utilized to arrange detailed stories like the next histograms for time to first token and token throughput.
Analyzing efficiency outcomes from LLMPerf and estimating prices utilizing Amazon CloudWatch
LLMPerf offers you the power to benchmark the efficiency of customized fashions served in Amazon Bedrock with out having to examine the specifics of the serving properties and configuration of your Amazon Bedrock Customized Mannequin Import deployment. This data is efficacious as a result of it represents the anticipated finish consumer expertise of your utility.
As well as, the benchmarking train can function a precious software for value estimation. Through the use of Amazon CloudWatch, you possibly can observe the variety of lively mannequin copies that Amazon Bedrock Customized Mannequin Import scales to in response to the load check. ModelCopy is uncovered as a CloudWatch metric within the AWS/Bedrock namespace and is reported utilizing the imported mannequin ARN as a label. The plot for the ModelCopy
metric is proven within the determine under. This knowledge will help in estimating prices, as a result of billing is predicated on the variety of lively mannequin copies at a given time.
Conclusion
Whereas Amazon Bedrock Customized Mannequin Import simplifies mannequin deployment and scaling, efficiency benchmarking stays important to foretell manufacturing efficiency, and examine fashions throughout key metrics akin to value, latency, and throughput.
To be taught extra, attempt the instance pocket book along with your customized mannequin.
Extra sources:
In regards to the Authors
Felipe Lopez is a Senior AI/ML Specialist Options Architect at AWS. Previous to becoming a member of AWS, Felipe labored with GE Digital and SLB, the place he targeted on modeling and optimization merchandise for industrial purposes.
Rupinder Grewal is a Senior AI/ML Specialist Options Architect with AWS. He at the moment focuses on the serving of fashions and MLOps on Amazon SageMaker. Previous to this position, he labored as a Machine Studying Engineer constructing and internet hosting fashions. Outdoors of labor, he enjoys taking part in tennis and biking on mountain trails.
Paras Mehra is a Senior Product Supervisor at AWS. He’s targeted on serving to construct Amazon Bedrock. In his spare time, Paras enjoys spending time together with his household and biking across the Bay Space.
Prashant Patel is a Senior Software program Growth Engineer in AWS Bedrock. He’s obsessed with scaling giant language fashions for enterprise purposes. Previous to becoming a member of AWS, he labored at IBM on productionizing large-scale AI/ML workloads on Kubernetes. Prashant has a grasp’s diploma from NYU Tandon Faculty of Engineering. Whereas not at work, he enjoys touring and taking part in together with his canines.