Deploy DeepSeek-R1 distilled fashions on Amazon SageMaker utilizing a Massive Mannequin Inference container

DeepSeek-R1 is a big language mannequin (LLM) developed by DeepSeek AI that makes use of reinforcement studying to reinforce reasoning capabilities by means of a multi-stage coaching course of from a DeepSeek-V3-Base basis. A key distinguishing characteristic is its reinforcement studying step, which was used to refine the mannequin’s responses past the usual pre-training and fine-tuning course of. By incorporating RL, DeepSeek-R1 can adapt extra successfully to consumer suggestions and targets, finally enhancing each relevance and readability. As well as, DeepSeek-R1 employs a chain-of-thought (CoT) strategy, that means it’s geared up to interrupt down advanced queries and motive by means of them in a step-by-step method. This guided reasoning course of permits the mannequin to supply extra correct, clear, and detailed solutions. This mannequin combines RL-based fine-tuning with CoT capabilities, aiming to generate structured responses whereas specializing in interpretability and consumer interplay. With its wide-ranging capabilities, DeepSeek-R1 has captured the business’s consideration as a flexible text-generation mannequin that may be built-in into varied workflows comparable to brokers, logical reasoning, and knowledge interpretation duties.

DeepSeek-R1 makes use of a Combination of Specialists (MoE) structure and is 671 billion parameters in measurement. The MoE structure permits activation of 37 billion parameters, enabling environment friendly inference by routing queries to essentially the most related skilled clusters. This strategy permits the mannequin to concentrate on totally different downside domains whereas sustaining total effectivity.

DeepSeek-R1 distilled fashions convey the reasoning capabilities of the principle R1 mannequin to extra environment friendly architectures based mostly on well-liked open fashions like Meta’s Llama (8B and 70B) and Hugging Face’s Qwen (1.5B, 7B, 14B, and 32B). Distillation refers to a course of of coaching smaller, extra environment friendly fashions to imitate the habits and reasoning patterns of the bigger DeepSeek-R1 mannequin, utilizing it as a trainer mannequin. For instance, DeepSeek-R1-Distill-Llama-8B presents a wonderful stability of efficiency and effectivity. By integrating this mannequin with Amazon SageMaker AI, you possibly can profit from the AWS scalable infrastructure whereas sustaining high-quality language mannequin capabilities.

On this submit, we present the best way to use the distilled fashions in SageMaker AI, which presents a number of choices to deploy the distilled variations of the R1 mannequin.

Answer overview

You should utilize DeepSeek’s distilled fashions inside the AWS managed machine studying (ML) infrastructure. We reveal the best way to deploy these fashions on SageMaker AI inference endpoints.

SageMaker AI presents a selection of which serving container to make use of for deployments:

LMI container – A Massive Mannequin Inference (LMI) container with totally different backends (vLLM, TensortRT-LLM, and Neuron). See the next GitHub repo for extra particulars.
TGI container – A Hugging Face Textual content Technology Interface (TGI) container. Yow will discover extra particulars within the following GitHub repo.

Within the following code snippets, we use the LMI container instance. See the next GitHub repo for extra deployment examples utilizing TGI, TensorRT-LLM, and Neuron.

LMI containers

LMI containers are a set of high-performance Docker containers goal constructed for LLM inference. With these containers, you should use high-performance open supply inference libraries like vLLM, TensorRT-LLM, and Transformers NeuronX to deploy LLMs on SageMaker endpoints. These containers bundle collectively a mannequin server with open supply inference libraries to ship an all-in-one LLM serving answer.

LMI containers present many options, together with:

Optimized inference efficiency for well-liked mannequin architectures like Meta Llama, Mistral, Falcon, and extra
Integration with open supply inference libraries like vLLM, TensorRT-LLM, and Transformers NeuronX
Steady batching for maximizing throughput at excessive concurrency
Token streaming
Quantization by means of AWQ, GPTQ, FP8, and extra
Multi-GPU inference utilizing tensor parallelism
Serving LoRA fine-tuned fashions
Textual content embedding to transform textual content knowledge into numeric vectors
Speculative decoding assist to lower latency

LMI containers present these options by means of integrations with well-liked inference libraries. A unified configuration format lets you use the most recent optimizations and applied sciences throughout libraries. To study extra in regards to the LMI parts, see Parts of LMI.

Stipulations

To run the instance notebooks, you want an AWS account with an AWS Identification and Entry Administration (IAM) position with permissions to handle sources created. For particulars, discuss with Create an AWS account.

If that is your first time working with Amazon SageMaker Studio, you first must create a SageMaker area. Moreover, you may must request a service quota enhance for the corresponding SageMaker internet hosting situations. On this instance, you host the bottom mannequin and a number of adapters on the identical SageMaker endpoint, so you’ll use an ml.g5.2xlarge SageMaker internet hosting occasion.

Deploy DeepSeek-R1 for inference

The next is a step-by-step instance that demonstrates the best way to programmatically deploy DeepSeek-R1-Distill-Llama-8B for inference. The code for deploying the mannequin is supplied within the GitHub repo. You’ll be able to clone the repo and run the pocket book from SageMaker AI Studio.

Configure the SageMaker execution position and import the required libraries:

!pip set up --force-reinstall --no-cache-dir sagemaker==2.235.2

import json
import boto3
import sagemaker

# Arrange IAM Function
strive:
    position = sagemaker.get_execution_role()
besides ValueError:
    iam = boto3.consumer('iam')
    position = iam.get_role(RoleName="sagemaker_execution_role")['Role']['Arn']

There are two methods to deploy an LLM like DeepSeek-R1 or its distilled variants on SageMaker:

Deploy uncompressed mannequin weights from an Amazon S3 bucket – On this state of affairs, you have to set the HF_MODEL_ID variable to the Amazon Easy Storage Service (Amazon S3) prefix that has mannequin artifacts. This technique is usually a lot quicker, with the mannequin usually downloading in simply a few minutes from Amazon S3.
Deploy immediately from Hugging Face Hub (requires web entry) – To do that, set HF_MODEL_ID to the Hugging Face repository or mannequin ID (for instance, “deepseek-ai/DeepSeek-R1-Distill-Llama-8B”). Nevertheless, this technique tends to be slower and might take considerably longer to obtain the mannequin in comparison with utilizing Amazon S3. This strategy is not going to work if enable_network_isolation is enabled, as a result of it requires web entry to retrieve mannequin artifacts from the Hugging Face Hub.

On this instance, we deploy the mannequin immediately from the Hugging Face Hub:

vllm_config = {
    "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ROLLING_BATCH": "vllm",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
}

The OPTION_MAX_ROLLING_BATCH_SIZE parameter limits variety of concurrent requests that may be processed by the endpoint. We set it to 16 to restrict GPU reminiscence necessities. It is best to alter it based mostly in your latency and throughput necessities.

Create and deploy the mannequin:

# Create a Mannequin object
lmi_model = sagemaker.Mannequin(
    image_uri = inference_image_uri,
    env = vllm_config,
    position = position,
    title = model_name,
    enable_network_isolation=True, # Ensures mannequin is remoted from the web
    vpc_config={
        "Subnets": ["subnet-xxxxxxxx", "subnet-yyyyyyyy"],
        "SecurityGroupIds": ["sg-zzzzzzzz"]
    }
)
# Deploy to SageMaker
lmi_model.deploy(
    initial_instance_count = 1,
    instance_type = "ml.g5.2xlarge",
    container_startup_health_check_timeout = 1600,
    endpoint_name = endpoint_name,
)

Make inference requests:

sagemaker_client = boto3.consumer('sagemaker-runtime', region_name="us-east-1")
endpoint_name = predictor.endpoint_name

input_payload = {
    "inputs": "What's Amazon SageMaker? Reply concisely.",
    "parameters": {"max_new_tokens": 250, "temperature": 0.1}
}

serialized_payload = json.dumps(input_payload)

response = sagemaker_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="utility/json",
    Physique=serialized_payload
)

Efficiency and value issues

The ml.g5.2xlarge occasion offers a great stability of efficiency and value. For giant-scale inference, use bigger batch sizes for real-time inference to optimize value and efficiency. You can even use batch remodel for offline, large-volume inference to cut back prices. Monitor endpoint utilization to optimize prices.

Clear up

Clear up your sources once they’re not wanted:

predictor.delete_endpoint()

Safety

You’ll be able to configure superior safety and infrastructure settings for the DeepSeek-R1 mannequin, together with digital personal cloud (VPC) networking, service position permissions, encryption settings, and EnableNetworkIsolation to limit web entry. For manufacturing deployments, it’s important to evaluation these settings to take care of alignment along with your group’s safety and compliance necessities.

By default, the mannequin runs in a shared AWS managed VPC with web entry. To boost safety and management entry, it’s best to explicitly configure a personal VPC with applicable safety teams and IAM insurance policies based mostly in your necessities.

SageMaker AI offers enterprise-grade security measures to assist hold your knowledge and purposes safe and personal. We don’t share your knowledge with mannequin suppliers, until you direct us to, offering you full management over your knowledge. This is applicable to all fashions—each proprietary and publicly obtainable, together with DeepSeek-R1 on SageMaker.

For extra particulars, see Configure safety in Amazon SageMaker AI.

Logging and monitoring

You’ll be able to monitor SageMaker AI utilizing Amazon CloudWatch, which collects and processes uncooked knowledge into readable, close to real-time metrics. These metrics are retained for 15 months, permitting you to investigate historic traits and acquire deeper insights into your utility’s efficiency and well being.

Moreover, you possibly can configure alarms to observe particular thresholds and set off notifications or automated actions when these thresholds are met, serving to you proactively handle your deployment.

For extra particulars, see Metrics for monitoring Amazon SageMaker AI with Amazon CloudWatch.

Finest practices

It’s at all times beneficial to deploy your LLMs endpoints inside your VPC and behind a personal subnet, with out web gateways, and ideally with no egress. Ingress from the web must also be blocked to attenuate safety dangers.

All the time apply guardrails to ensure incoming and outgoing mannequin responses are validated for security, bias, and toxicity. You’ll be able to guard your SageMaker endpoints mannequin responses with Amazon Bedrock Guardrails. See DeepSeek-R1 mannequin now obtainable in Amazon Bedrock Market and Amazon SageMaker JumpStart for extra particulars.

Inference efficiency analysis

On this part, we give attention to inference efficiency of DeepSeek-R1 distilled variants on SageMaker AI. Evaluating the efficiency of LLMs by way of end-to-end latency, throughput, and useful resource effectivity is essential for offering responsiveness, scalability, and cost-effectiveness in real-world purposes. Optimizing these metrics immediately impacts consumer expertise, system reliability, and deployment feasibility at scale. For this submit, we take a look at all DeepSeek-R1 distilled variants—1.5B, 7B, 8B, 14B, 32B, and 70B—alongside 4 efficiency metrics:

Finish-to-end latency (time between sending a request and receiving the response)
Throughput tokens
Time to first token
Inter-token latency

The principle goal of this efficiency analysis is to present you a sign about relative efficiency of distilled R1 fashions on totally different {hardware} for generic site visitors patterns. We didn’t attempt to optimize the efficiency for every mannequin/{hardware}/use case mixture. These outcomes shouldn’t be handled like a absolute best efficiency of a selected mannequin on a selected occasion kind. It is best to at all times carry out your personal testing utilizing your personal datasets and site visitors patterns in addition to I/O sequence size.

Situations

We examined the next situations:

Container/mannequin configuration – We used LMI container v14 with default parameters, besides MAX_MODEL_LEN, which was set to 10000 (no chunked prefix and no prefix caching). On situations with a number of accelerators, we sharded the mannequin throughout all obtainable GPUs.
Tokens – We evaluated SageMaker endpoint hosted DeepSeek-R1 distilled variants on efficiency benchmarks utilizing two pattern enter token lengths. We ran each assessments 50 occasions every earlier than measuring the typical throughout the totally different metrics. Then we repeated the take a look at with concurrency 10.
- Brief-length take a look at – 512 enter tokens and 256 output tokens.
- Medium-length take a look at – 3072 enter tokens and 256 output tokens.
{Hardware} – We examined the distilled variants on quite a lot of occasion sorts starting from 1, 4, or 8 GPUs per occasion. Within the following desk, a inexperienced cell signifies {that a} mannequin was examined on that individual occasion kind, and pink signifies {that a} mannequin wasn’t examined with that occasion kind, both as a result of the occasion was extreme for a given mannequin measurement or too small to suit the mannequin in reminiscence.

Field plots

Within the following sections, we use a field plot to visualise mannequin efficiency. A field is a concise visible abstract that shows a dataset’s median, interquartile vary (IQR), and potential outliers utilizing a field for the center 50% of the info, with whiskers extending to the smallest and largest non-outlier values. By inspecting the median’s placement inside the field, the field’s measurement, and the whiskers’ lengths, you possibly can shortly assess the info’s central tendency, variability, and skewness, as illustrated within the following determine.

DeepSeek-R1-Distill-Qwen-1.5B

This mannequin might be deployed on a single GPU occasion. The outcomes point out that the ml.g5.xlarge occasion outperforms the ml.g6.xlarge occasion throughout all measured efficiency standards and concurrency settings.

The next determine illustrates testing with concurrency = 1.