DeepSeek-R1 is a big language mannequin (LLM) developed by DeepSeek AI that makes use of reinforcement studying to reinforce reasoning capabilities by means of a multi-stage coaching course of from a DeepSeek-V3-Base basis. A key distinguishing characteristic is its reinforcement studying step, which was used to refine the mannequin’s responses past the usual pre-training and fine-tuning course of. By incorporating RL, DeepSeek-R1 can adapt extra successfully to consumer suggestions and targets, finally enhancing each relevance and readability. As well as, DeepSeek-R1 employs a chain-of-thought (CoT) strategy, that means it’s geared up to interrupt down advanced queries and motive by means of them in a step-by-step method. This guided reasoning course of permits the mannequin to supply extra correct, clear, and detailed solutions. This mannequin combines RL-based fine-tuning with CoT capabilities, aiming to generate structured responses whereas specializing in interpretability and consumer interplay. With its wide-ranging capabilities, DeepSeek-R1 has captured the business’s consideration as a flexible text-generation mannequin that may be built-in into varied workflows comparable to brokers, logical reasoning, and knowledge interpretation duties.
DeepSeek-R1 makes use of a Combination of Specialists (MoE) structure and is 671 billion parameters in measurement. The MoE structure permits activation of 37 billion parameters, enabling environment friendly inference by routing queries to essentially the most related skilled clusters. This strategy permits the mannequin to concentrate on totally different downside domains whereas sustaining total effectivity.
DeepSeek-R1 distilled fashions convey the reasoning capabilities of the principle R1 mannequin to extra environment friendly architectures based mostly on well-liked open fashions like Meta’s Llama (8B and 70B) and Hugging Face’s Qwen (1.5B, 7B, 14B, and 32B). Distillation refers to a course of of coaching smaller, extra environment friendly fashions to imitate the habits and reasoning patterns of the bigger DeepSeek-R1 mannequin, utilizing it as a trainer mannequin. For instance, DeepSeek-R1-Distill-Llama-8B presents a wonderful stability of efficiency and effectivity. By integrating this mannequin with Amazon SageMaker AI, you possibly can profit from the AWS scalable infrastructure whereas sustaining high-quality language mannequin capabilities.
On this submit, we present the best way to use the distilled fashions in SageMaker AI, which presents a number of choices to deploy the distilled variations of the R1 mannequin.
Answer overview
You should utilize DeepSeek’s distilled fashions inside the AWS managed machine studying (ML) infrastructure. We reveal the best way to deploy these fashions on SageMaker AI inference endpoints.
SageMaker AI presents a selection of which serving container to make use of for deployments:
- LMI container – A Massive Mannequin Inference (LMI) container with totally different backends (vLLM, TensortRT-LLM, and Neuron). See the next GitHub repo for extra particulars.
- TGI container – A Hugging Face Textual content Technology Interface (TGI) container. Yow will discover extra particulars within the following GitHub repo.
Within the following code snippets, we use the LMI container instance. See the next GitHub repo for extra deployment examples utilizing TGI, TensorRT-LLM, and Neuron.
LMI containers
LMI containers are a set of high-performance Docker containers goal constructed for LLM inference. With these containers, you should use high-performance open supply inference libraries like vLLM, TensorRT-LLM, and Transformers NeuronX to deploy LLMs on SageMaker endpoints. These containers bundle collectively a mannequin server with open supply inference libraries to ship an all-in-one LLM serving answer.
LMI containers present many options, together with:
- Optimized inference efficiency for well-liked mannequin architectures like Meta Llama, Mistral, Falcon, and extra
- Integration with open supply inference libraries like vLLM, TensorRT-LLM, and Transformers NeuronX
- Steady batching for maximizing throughput at excessive concurrency
- Token streaming
- Quantization by means of AWQ, GPTQ, FP8, and extra
- Multi-GPU inference utilizing tensor parallelism
- Serving LoRA fine-tuned fashions
- Textual content embedding to transform textual content knowledge into numeric vectors
- Speculative decoding assist to lower latency
LMI containers present these options by means of integrations with well-liked inference libraries. A unified configuration format lets you use the most recent optimizations and applied sciences throughout libraries. To study extra in regards to the LMI parts, see Parts of LMI.
Stipulations
To run the instance notebooks, you want an AWS account with an AWS Identification and Entry Administration (IAM) position with permissions to handle sources created. For particulars, discuss with Create an AWS account.
If that is your first time working with Amazon SageMaker Studio, you first must create a SageMaker area. Moreover, you may must request a service quota enhance for the corresponding SageMaker internet hosting situations. On this instance, you host the bottom mannequin and a number of adapters on the identical SageMaker endpoint, so you’ll use an ml.g5.2xlarge SageMaker internet hosting occasion.
Deploy DeepSeek-R1 for inference
The next is a step-by-step instance that demonstrates the best way to programmatically deploy DeepSeek-R1-Distill-Llama-8B for inference. The code for deploying the mannequin is supplied within the GitHub repo. You’ll be able to clone the repo and run the pocket book from SageMaker AI Studio.
- Configure the SageMaker execution position and import the required libraries:
There are two methods to deploy an LLM like DeepSeek-R1 or its distilled variants on SageMaker:
- Deploy uncompressed mannequin weights from an Amazon S3 bucket – On this state of affairs, you have to set the HF_MODEL_ID variable to the Amazon Easy Storage Service (Amazon S3) prefix that has mannequin artifacts. This technique is usually a lot quicker, with the mannequin usually downloading in simply a few minutes from Amazon S3.
- Deploy immediately from Hugging Face Hub (requires web entry) – To do that, set HF_MODEL_ID to the Hugging Face repository or mannequin ID (for instance, “deepseek-ai/DeepSeek-R1-Distill-Llama-8B”). Nevertheless, this technique tends to be slower and might take considerably longer to obtain the mannequin in comparison with utilizing Amazon S3. This strategy is not going to work if enable_network_isolation is enabled, as a result of it requires web entry to retrieve mannequin artifacts from the Hugging Face Hub.
- On this instance, we deploy the mannequin immediately from the Hugging Face Hub:
The OPTION_MAX_ROLLING_BATCH_SIZE
parameter limits variety of concurrent requests that may be processed by the endpoint. We set it to 16 to restrict GPU reminiscence necessities. It is best to alter it based mostly in your latency and throughput necessities.
- Create and deploy the mannequin:
- Make inference requests:
Efficiency and value issues
The ml.g5.2xlarge occasion offers a great stability of efficiency and value. For giant-scale inference, use bigger batch sizes for real-time inference to optimize value and efficiency. You can even use batch remodel for offline, large-volume inference to cut back prices. Monitor endpoint utilization to optimize prices.
Clear up
Clear up your sources once they’re not wanted:
Safety
You’ll be able to configure superior safety and infrastructure settings for the DeepSeek-R1 mannequin, together with digital personal cloud (VPC) networking, service position permissions, encryption settings, and EnableNetworkIsolation to limit web entry. For manufacturing deployments, it’s important to evaluation these settings to take care of alignment along with your group’s safety and compliance necessities.
By default, the mannequin runs in a shared AWS managed VPC with web entry. To boost safety and management entry, it’s best to explicitly configure a personal VPC with applicable safety teams and IAM insurance policies based mostly in your necessities.
SageMaker AI offers enterprise-grade security measures to assist hold your knowledge and purposes safe and personal. We don’t share your knowledge with mannequin suppliers, until you direct us to, offering you full management over your knowledge. This is applicable to all fashions—each proprietary and publicly obtainable, together with DeepSeek-R1 on SageMaker.
For extra particulars, see Configure safety in Amazon SageMaker AI.
Logging and monitoring
You’ll be able to monitor SageMaker AI utilizing Amazon CloudWatch, which collects and processes uncooked knowledge into readable, close to real-time metrics. These metrics are retained for 15 months, permitting you to investigate historic traits and acquire deeper insights into your utility’s efficiency and well being.
Moreover, you possibly can configure alarms to observe particular thresholds and set off notifications or automated actions when these thresholds are met, serving to you proactively handle your deployment.
For extra particulars, see Metrics for monitoring Amazon SageMaker AI with Amazon CloudWatch.
Finest practices
It’s at all times beneficial to deploy your LLMs endpoints inside your VPC and behind a personal subnet, with out web gateways, and ideally with no egress. Ingress from the web must also be blocked to attenuate safety dangers.
All the time apply guardrails to ensure incoming and outgoing mannequin responses are validated for security, bias, and toxicity. You’ll be able to guard your SageMaker endpoints mannequin responses with Amazon Bedrock Guardrails. See DeepSeek-R1 mannequin now obtainable in Amazon Bedrock Market and Amazon SageMaker JumpStart for extra particulars.
Inference efficiency analysis
On this part, we give attention to inference efficiency of DeepSeek-R1 distilled variants on SageMaker AI. Evaluating the efficiency of LLMs by way of end-to-end latency, throughput, and useful resource effectivity is essential for offering responsiveness, scalability, and cost-effectiveness in real-world purposes. Optimizing these metrics immediately impacts consumer expertise, system reliability, and deployment feasibility at scale. For this submit, we take a look at all DeepSeek-R1 distilled variants—1.5B, 7B, 8B, 14B, 32B, and 70B—alongside 4 efficiency metrics:
- Finish-to-end latency (time between sending a request and receiving the response)
- Throughput tokens
- Time to first token
- Inter-token latency
The principle goal of this efficiency analysis is to present you a sign about relative efficiency of distilled R1 fashions on totally different {hardware} for generic site visitors patterns. We didn’t attempt to optimize the efficiency for every mannequin/{hardware}/use case mixture. These outcomes shouldn’t be handled like a absolute best efficiency of a selected mannequin on a selected occasion kind. It is best to at all times carry out your personal testing utilizing your personal datasets and site visitors patterns in addition to I/O sequence size.
Situations
We examined the next situations:
- Container/mannequin configuration – We used LMI container v14 with default parameters, besides
MAX_MODEL_LEN
, which was set to 10000 (no chunked prefix and no prefix caching). On situations with a number of accelerators, we sharded the mannequin throughout all obtainable GPUs. - Tokens – We evaluated SageMaker endpoint hosted DeepSeek-R1 distilled variants on efficiency benchmarks utilizing two pattern enter token lengths. We ran each assessments 50 occasions every earlier than measuring the typical throughout the totally different metrics. Then we repeated the take a look at with concurrency 10.
- Brief-length take a look at – 512 enter tokens and 256 output tokens.
- Medium-length take a look at – 3072 enter tokens and 256 output tokens.
- {Hardware} – We examined the distilled variants on quite a lot of occasion sorts starting from 1, 4, or 8 GPUs per occasion. Within the following desk, a inexperienced cell signifies {that a} mannequin was examined on that individual occasion kind, and pink signifies {that a} mannequin wasn’t examined with that occasion kind, both as a result of the occasion was extreme for a given mannequin measurement or too small to suit the mannequin in reminiscence.
Field plots
Within the following sections, we use a field plot to visualise mannequin efficiency. A field is a concise visible abstract that shows a dataset’s median, interquartile vary (IQR), and potential outliers utilizing a field for the center 50% of the info, with whiskers extending to the smallest and largest non-outlier values. By inspecting the median’s placement inside the field, the field’s measurement, and the whiskers’ lengths, you possibly can shortly assess the info’s central tendency, variability, and skewness, as illustrated within the following determine.
DeepSeek-R1-Distill-Qwen-1.5B
This mannequin might be deployed on a single GPU occasion. The outcomes point out that the ml.g5.xlarge
occasion outperforms the ml.g6.xlarge
occasion throughout all measured efficiency standards and concurrency settings.
The next determine illustrates testing with concurrency = 1.
The next determine illustrates testing with concurrency = 10.
DeepSeek-R1-Distill-Qwen-7B
DeepSeek-R1-Distill-Qwen-7B was examined on ml.g5.2xlarge
and ml.g6e.2xlarge
. Amongst all situations, ml.g6e.2xlarge
demonstrated the best efficiency.
The next determine illustrates testing with concurrency = 1.
The next determine illustrates testing with concurrency = 10.
DeepSeek-R1-Distill-Llama-8B
DeepSeek-R1-Distill-Llama-8B was benchmarked throughout ml.g5.2xlarge
, ml.g5.12xlarge
, ml.g6e.2xlarge
, and ml.g6e.12xlarge
, with ml.g6e.12xlarge
demonstrating the best efficiency amongst all situations.
The next determine illustrates testing with concurrency = 1.
The next determine illustrates testing with concurrency = 10.
DeepSeek-R1-Distill-Qwen-14B
We examined this mannequin on ml.g6.12xlarge
, ml.g5.12xlarge
, ml.g6e.48xlarge
, and ml.g6e.12xlarge
. The occasion with 8 GPU (ml.g6e.48xlarge
) confirmed one of the best outcomes.
The next determine illustrates testing with concurrency = 1.
The next determine illustrates testing with concurrency = 10.
DeepSeek-R1-Distill-Qwen-32B
This is a pretty big mannequin, and we solely deployed it on multi-GPU situations: ml.g6.12xlarge
, ml.g5.12xlarge
, and ml.g6e.12xlarge
. The most recent era (ml.g6e.12xlarge
) confirmed one of the best efficiency throughout all concurrency settings.
The next determine illustrates testing with concurrency = 1.
The next determine illustrates testing with concurrency = 10.
DeepSeek-R1-Distill-Llama-70B
We examined this mannequin on two totally different 8 GPUs situations: ml.g6e.48xlarge
and ml.p4d.24xlarge
. The latter confirmed one of the best efficiency.
The next determine illustrates testing with concurrency = 1.
The next determine illustrates testing with concurrency = 10.
Conclusion
Deploying DeepSeek fashions on SageMaker AI offers a strong answer for organizations searching for to make use of state-of-the-art language fashions of their purposes. The mixture of DeepSeek’s highly effective fashions and SageMaker AI managed infrastructure presents a scalable and environment friendly strategy to pure language processing duties.
The efficiency analysis part presents a complete efficiency analysis of all DeepSeek-R1 distilled fashions throughout 4 key inference metrics, utilizing 13 totally different NVIDIA accelerator occasion sorts. This evaluation presents beneficial insights to help within the choice of the optimum occasion kind for deploying the DeepSeek-R1 answer.
Take a look at the entire code within the following GitHub repos:
For added sources, discuss with:
Concerning the Authors
Dmitry Soldatkin is a Senior AI/ML Options Architect at Amazon Net Companies (AWS), serving to prospects design and construct AI/ML options. Dmitry’s work covers a variety of ML use instances, with a main curiosity in Generative AI, deep studying, and scaling ML throughout the enterprise. He has helped firms in lots of industries, together with insurance coverage, monetary providers, utilities, and telecommunications. You’ll be able to join with Dmitry on LinkedIn.
Vivek Gangasani is a Lead Specialist Options Architect for Inference at AWS. He helps rising generative AI firms construct progressive options utilizing AWS providers and accelerated compute. At present, he’s targeted on growing methods for fine-tuning and optimizing the inference efficiency of enormous language fashions. In his free time, Vivek enjoys climbing, watching motion pictures, and making an attempt totally different cuisines.
Prasanna Sridharan is a Principal Gen AI/ML Architect at AWS, specializing in designing and implementing AI/ML and Generative AI options for enterprise prospects. With a ardour for serving to AWS prospects construct progressive Gen AI purposes, he focuses on creating scalable, cutting-edge AI options that drive enterprise transformation. You’ll be able to join with Prasanna on LinkedIn.
Pranav Murthy is an AI/ML Specialist Options Architect at AWS. He focuses on serving to prospects construct, prepare, deploy and migrate machine studying (ML) workloads to SageMaker. He beforehand labored within the semiconductor business growing giant pc imaginative and prescient (CV) and pure language processing (NLP) fashions to enhance semiconductor processes utilizing cutting-edge ML methods. In his free time, he enjoys enjoying chess and touring. Yow will discover Pranav on LinkedIn.