Deploy DeepSeek-R1 Distilled Llama fashions in Amazon Bedrock

Open basis fashions (FMs) have turn out to be a cornerstone of generative AI innovation, enabling organizations to construct and customise AI functions whereas sustaining management over their prices and deployment methods. By offering high-quality, overtly obtainable fashions, the AI neighborhood fosters fast iteration, information sharing, and cost-effective options that profit each builders and end-users. DeepSeek AI, a analysis firm centered on advancing AI expertise, has emerged as a big contributor to this ecosystem. Their DeepSeek-R1 fashions characterize a household of huge language fashions (LLMs) designed to deal with a variety of duties, from code technology to normal reasoning, whereas sustaining aggressive efficiency and effectivity.

Amazon Bedrock Customized Mannequin Import permits the import and use of your personalized fashions alongside current FMs via a single serverless, unified API. You’ll be able to entry your imported customized fashions on-demand and with out the necessity to handle underlying infrastructure. Speed up your generative AI software growth by integrating your supported customized fashions with native Bedrock instruments and options like Data Bases, Guardrails, and Brokers.

On this put up, we discover easy methods to deploy distilled variations of DeepSeek-R1 with Amazon Bedrock Customized Mannequin Import, making them accessible to organizations wanting to make use of state-of-the-art AI capabilities inside the safe and scalable AWS infrastructure at an efficient price.

DeepSeek-R1 distilled variations

From the muse of DeepSeek-R1, DeepSeek AI has created a collection of distilled fashions based mostly on each Meta’s Llama and Qwen architectures, starting from 1.5–70 billion parameters. The distillation course of includes coaching smaller, extra environment friendly fashions to imitate the habits and reasoning patterns of the bigger DeepSeek-R1 mannequin through the use of it as a trainer—basically transferring the information and capabilities of the 671 billion parameter mannequin into extra compact architectures. The ensuing distilled fashions, akin to DeepSeek-R1-Distill-Llama-8B (from base mannequin Llama-3.1-8B) and DeepSeek-R1-Distill-Llama-70B (from base mannequin Llama-3.3-70B-Instruct), provide completely different trade-offs between efficiency and useful resource necessities. Though distilled fashions would possibly present some discount in reasoning capabilities in comparison with the unique 671B mannequin, they considerably enhance inference pace and scale back computational prices. For example, smaller distilled fashions just like the 8B model can course of requests a lot quicker and devour fewer sources, making them more cost effective for manufacturing deployments, whereas bigger distilled variations just like the 70B mannequin preserve nearer efficiency to the unique whereas nonetheless providing significant effectivity features.

Resolution overview

On this put up, we exhibit easy methods to deploy distilled variations of DeepSeek-R1 fashions utilizing Amazon Bedrock Customized Mannequin Import. We deal with importing the variants at present supported DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Llama-70B, which supply an optimum stability between efficiency and useful resource effectivity. You’ll be able to import these fashions from Amazon Easy Storage Service (Amazon S3) or an Amazon SageMaker AI mannequin repo, and deploy them in a completely managed and serverless setting via Amazon Bedrock. The next diagram illustrates the end-to-end circulation.

On this workflow, mannequin artifacts saved in Amazon S3 are imported into Amazon Bedrock, which then handles the deployment and scaling of the mannequin routinely. This serverless method eliminates the necessity for infrastructure administration whereas offering enterprise-grade safety and scalability.

You need to use the Amazon Bedrock console for deploying utilizing the graphical interface and following the directions on this put up, or alternatively use the following pocket book to deploy programmatically with the Amazon Bedrock SDK.

Conditions

It’s best to have the next stipulations:

Put together the mannequin bundle

Full the next steps to organize the mannequin bundle:

Obtain the DeepSeek-R1-Distill-Llama mannequin artifacts from Hugging Face, from one of many following hyperlinks, relying on the mannequin you wish to deploy:
1. https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/tree/principal
2. https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B/tree/principal

For extra info, you’ll be able to observe the Hugging Face’s Downloading fashions or Obtain information from the hub directions.

You sometimes want the next information:

- Mannequin configuration file: config.json
- Tokenizer information: tokenizer.json, tokenizer_config.json, and tokenizer.mode
- Mannequin weights information in .safetensors format

Add these information to a folder in your S3 bucket, in the identical AWS Area the place you propose to make use of Amazon Bedrock. Be aware of the S3 path you’re utilizing.

Import the mannequin

Full the next steps to import the mannequin:

On the Amazon Bedrock console, select Imported fashions beneath Basis fashions within the navigation pane.

Select Import mannequin.

For Mannequin identify, enter a reputation in your mannequin (it’s really helpful to make use of a versioning scheme in your identify, for monitoring your imported mannequin).
For Import job identify, enter a reputation in your import job.
For Mannequin import settings, choose Amazon S3 bucket as your import supply, and enter the S3 path you famous earlier (present the complete path within the kind s3:///folder-with-model-artifacts/).
For Encryption, optionally select to customise your encryption settings.
For Service entry position, select to both create a brand new IAM position or present your individual.
Select Import mannequin.

Importing the mannequin will take a number of minutes relying on the mannequin being imported (for instance, the Distill-Llama-8B mannequin might take 5–20 minutes to finish).

Watch this video demo for a step-by-step information.

Check the imported mannequin

After you import the mannequin, you’ll be able to check it through the use of the Amazon Bedrock Playground or immediately via the Amazon Bedrock invocation APIs. To make use of the Playground, full the next steps:

On the Amazon Bedrock console, select Chat / Textual content beneath Playgrounds within the navigation pane.
From the mannequin selector, select your imported mannequin identify.
Alter the inference parameters as wanted and write your check immediate. For instance:
<｜start▁of▁sentence｜><｜Person｜>Given the next monetary knowledge: - Firm A's income grew from $10M to $15M in 2023 - Working prices elevated by 20% - Preliminary working prices had been $7M Calculate the corporate's working margin for 2023. Please motive step-by-step, and put your ultimate reply inside boxed{}<｜Assistant｜>

As we’re utilizing an imported mannequin within the playground, we should embrace the “beginning_of_sentence” and “consumer/assistant” tags to correctly format the context for DeepSeek fashions; these tags assist the mannequin perceive the construction of the dialog and supply extra correct responses. When you’re following the programmatic method within the following pocket book then that is being routinely taken care of by configuring the mannequin.

Assessment the mannequin response and metrics supplied.

Word: Once you invoke the mannequin for the primary time, when you encounter a ModelNotReadyException error the SDK routinely retries the request with exponential backoff. The restoration time varies relying on the on-demand fleet measurement and mannequin measurement. You’ll be able to customise the retry habits utilizing the AWS SDK for Python (Boto3) Config object. For extra info, see Dealing with ModelNotReadyException.

As soon as you’re able to import the mannequin, use this step-by-step video demo that will help you get began.

Pricing

Customized Mannequin Import lets you use your customized mannequin weights inside Amazon Bedrock for supported architectures, serving them alongside Amazon Bedrock hosted FMs in a completely managed means via On-Demand mode. Customized Mannequin Import doesn’t cost for mannequin import, you’re charged for inference based mostly on two elements: the variety of energetic mannequin copies and their length of exercise.

Billing happens in 5-minute home windows, ranging from the primary profitable invocation of every mannequin copy. The pricing per mannequin copy per minute varies based mostly on elements together with structure, context size, area, and compute unit model, and is tiered by mannequin copy measurement. The Customized Mannequin Models required for internet hosting will depend on the mannequin’s structure, parameter depend, and context size, with examples starting from 2 Models for a Llama 3.1 8B 128K mannequin to eight Models for a Llama 3.1 70B 128K mannequin.

Amazon Bedrock routinely manages scaling, sustaining zero to 3 mannequin copies by default (adjustable via Service Quotas) based mostly in your utilization patterns. If there are not any invocations for five minutes, it scales to zero and scales up when wanted, although this may increasingly contain cold-start latency of tens of seconds. Further copies are added if inference quantity persistently exceeds single-copy concurrency limits. The utmost throughput and concurrency per copy is decided throughout import, based mostly on elements akin to enter/output token combine, {hardware} sort, mannequin measurement, structure, and inference optimizations.

Take into account the next pricing instance: An software developer imports a personalized Llama 3.1 sort mannequin that’s 8B parameter in measurement with a 128K sequence size in us-east-1 area and deletes the mannequin after 1 month. This requires 2 Customized Mannequin Models. So, the value per minute will probably be $0.1570 and the mannequin storage prices will probably be $3.90 for the month.

For extra info, see Amazon Bedrock pricing.

Benchmarks

DeepSeek has printed benchmarks evaluating their distilled fashions in opposition to the unique DeepSeek-R1 and base Llama fashions, obtainable within the mannequin repositories. The benchmarks present that relying on the duty DeepSeek-R1-Distill-Llama-70B maintains between 80-90% of the unique mannequin’s reasoning capabilities, whereas the 8B model achieves between 59-92% efficiency with considerably diminished useful resource necessities. Each distilled variations exhibit enhancements over their corresponding base Llama fashions in particular reasoning duties.

Different concerns

When deploying DeepSeek fashions in Amazon Bedrock, think about the next elements:

Mannequin versioning is important. As a result of Customized Mannequin Import creates distinctive fashions for every import, implement a transparent versioning technique in your mannequin names to trace completely different variations and variations.
The present supported mannequin codecs deal with Llama-based architectures. Though DeepSeek-R1 distilled variations provide wonderful efficiency, the AI ecosystem continues evolving quickly. Control the Amazon Bedrock mannequin catalog as new architectures and bigger fashions turn out to be obtainable via the platform.
Consider your use case necessities fastidiously. Though bigger fashions like DeepSeek-R1-Distill-Llama-70B present higher efficiency, the 8B model would possibly provide ample functionality for a lot of functions at a decrease price.
Take into account implementing monitoring and observability. Amazon CloudWatch supplies metrics in your imported fashions, serving to you observe utilization patterns and efficiency. You’ll be able to monitor prices with AWS Value Explorer.
Begin with a decrease concurrency quota and scale up based mostly on precise utilization patterns. The default restrict of three concurrent mannequin copies per account is appropriate for many preliminary deployments.

Conclusion

Amazon Bedrock Customized Mannequin Import empowers organizations to make use of highly effective publicly obtainable fashions like DeepSeek-R1 distilled variations, amongst others, whereas benefiting from enterprise-grade infrastructure. The serverless nature of Amazon Bedrock eliminates the complexity of managing mannequin deployments and operations, permitting groups to deal with constructing functions somewhat than infrastructure. With options like auto scaling, pay-per-use pricing, and seamless integration with AWS companies, Amazon Bedrock supplies a production-ready setting for AI workloads. The mix of DeepSeek’s modern distillation method and the Amazon Bedrock managed infrastructure provides an optimum stability of efficiency, price, and operational effectivity. Organizations can begin with smaller fashions and scale up as wanted, whereas sustaining full management over their mannequin deployments and benefiting from AWS safety and compliance capabilities.

The power to decide on between proprietary and open FMs Amazon Bedrock provides organizations the pliability to optimize for his or her particular wants. Open fashions allow cost-effective deployment with full management over the mannequin artifacts, making them splendid for situations the place customization, price optimization, or mannequin transparency are essential. This flexibility, mixed with the Amazon Bedrock unified API and enterprise-grade infrastructure, permits organizations to construct resilient AI methods that may adapt as their necessities evolve.

For extra info, seek advice from the Amazon Bedrock Person Information.

Concerning the Authors

Raj Pathak is a Principal Options Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance coverage, Capital Markets) prospects throughout Canada and america. Raj makes a speciality of Machine Studying with functions in Generative AI, Pure Language Processing, Clever Doc Processing, and MLOps.

Yanyan Zhang is a Senior Generative AI Information Scientist at Amazon Net Companies, the place she has been engaged on cutting-edge AI/ML applied sciences as a Generative AI Specialist, serving to prospects use generative AI to attain their desired outcomes. Yanyan graduated from Texas A&M College with a PhD in Electrical Engineering. Exterior of labor, she loves touring, understanding, and exploring new issues.

Ishan Singh is a Generative AI Information Scientist at Amazon Net Companies, the place he helps prospects construct modern and accountable generative AI options and merchandise. With a powerful background in AI/ML, Ishan makes a speciality of constructing Generative AI options that drive enterprise worth. Exterior of labor, he enjoys enjoying volleyball, exploring native bike trails, and spending time along with his spouse and canine, Beau.

Morgan Rankey is a Options Architect based mostly in New York Metropolis, specializing in Hedge Funds. He excels in aiding prospects to construct resilient workloads inside the AWS ecosystem. Previous to becoming a member of AWS, Morgan led the Gross sales Engineering staff at Riskified via its IPO. He started his profession by specializing in AI/ML options for machine asset administration, serving among the largest automotive corporations globally.

Harsh Patel is an AWS Options Architect supporting 200+ SMB prospects throughout america to drive digital transformation via cloud-native options. As an AI&ML Specialist, he focuses on Generative AI, Laptop Imaginative and prescient, Reinforcement Studying and Anomaly Detection. Exterior the tech world, he recharges by hitting the golf course and embarking on scenic hikes along with his canine.

Deploy DeepSeek-R1 Distilled Llama fashions in Amazon Bedrock

NLP Illustrated, Half 3: Word2Vec | by Shreya Rao | Jan, 2025

Nice Books for AI Engineering. 10 books with invaluable insights about… | by Duncan McKinnon | Jan, 2025

Nice Books for AI Engineering. 10 books with invaluable insights about… | by Duncan McKinnon | Jan, 2025

Leave a Reply Cancel reply

Popular News

How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

Diffusion Mannequin from Scratch in Pytorch | by Nicholas DiSalvo | Jul, 2024

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

Streamlit fairly styled dataframes half 1: utilizing the pandas Styler

Proton launches ‘Privacy-First’ AI Email Assistant to Compete with Google and Microsoft

About Us

Category

Recent Posts