The Qwen 2.5 multilingual massive language fashions (LLMs) are a group of pre-trained and instruction tuned generative fashions in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B (textual content in/textual content out and code out). The Qwen 2.5 tremendous tuned text-only fashions are optimized for multilingual dialogue use instances and outperform each earlier generations of Qwen fashions, and most of the publicly out there chat fashions primarily based on widespread business benchmarks.
At its core, Qwen 2.5 is an auto-regressive language mannequin that makes use of an optimized transformer structure. The Qwen2.5 assortment can help over 29 languages and has enhanced role-playing skills and condition-setting for chatbots.
On this submit, we define tips on how to get began with deploying the Qwen 2.5 household of fashions on an Inferentia occasion utilizing Amazon Elastic Compute Cloud (Amazon EC2) and Amazon SageMaker utilizing the Hugging Face Textual content Technology Inference (TGI) container and the Hugging Face Optimum Neuron library. Qwen2.5 Coder and Math variants are additionally supported.
Preparation
Hugging Face gives two instruments which might be continuously used when utilizing AWS Inferentia and AWS Trainium: Textual content Technology Inference (TGI) containers, which give help for deploying and serving LLMS, and the Optimum Neuron library, which serves as an interface between the Transformers library and the Inferentia and Trainium accelerators.
The primary time a mannequin is run on Inferentia or Trainium, you compile the mannequin to just be sure you have a model that can carry out optimally on Inferentia and Trainium chips. The Optimum Neuron library from Hugging Face together with the Optimum Neuron cache will transparently provide a compiled mannequin when out there. Should you’re utilizing a unique mannequin with the Qwen2.5 structure, you would possibly must compile the mannequin earlier than deploying. For extra info, see Compiling a mannequin for Inferentia or Trainium.
You possibly can deploy TGI as a docker container on an Inferentia or Trainium EC2 occasion or on Amazon SageMaker.
Choice 1: Deploy TGI on Amazon EC2 Inf2
On this instance, you’ll deploy Qwen2.5-7B-Instruct on an inf2.xlarge occasion. (See this text for detailed directions on tips on how to deploy an occasion utilizing the Hugging Face DLAMI.)
For this selection, you SSH into the occasion and create a .env file (the place you’ll outline your constants and specify the place your mannequin is cached) and a file named docker-compose.yaml (the place you’ll outline all the setting parameters that you just’ll must deploy your mannequin for inference). You possibly can copy the next information for this use case.
- Create a .env file with the next content material:
- Create a file named docker-compose.yaml with the next content material:
- Use docker compose to deploy the mannequin:
docker compose -f docker-compose.yaml --env-file .env up
- To verify that the mannequin deployed appropriately, ship a take a look at immediate to the mannequin:
- To verify that the mannequin can reply in a number of languages, strive sending a immediate in Chinese language:
Choice 2: Deploy TGI on SageMaker
It’s also possible to use Hugging Face’s Optimum Neuron library to shortly deploy fashions immediately from SageMaker utilizing directions on the Hugging Face Mannequin Hub.
- From the Qwen 2.5 mannequin card hub, select Deploy, then SageMaker, and at last AWS Inferentia & Trainium.
- Copy the instance code right into a SageMaker pocket book, then select Run.
- The pocket book you copied will appear like the next:
Clear Up
Just remember to terminate your EC2 cases and delete your SageMaker endpoints to keep away from ongoing prices.
Terminate EC2 cases by the AWS Administration Console.
Terminate a SageMaker endpoint by the console or with the next instructions:
Conclusion
AWS Trainium and AWS Inferentia ship excessive efficiency and low price for deploying Qwen2.5 fashions. We’re excited to see how you’ll use these highly effective fashions and our purpose-built AI infrastructure to construct differentiated AI functions. To study extra about tips on how to get began with AWS AI chips, see the AWS Neuron documentation.
Concerning the Authors
Jim Burtoft is a Senior Startup Options Architect at AWS and works immediately with startups in addition to the staff at Hugging Face. Jim is a CISSP, a part of the AWS AI/ML Technical Subject Neighborhood, a part of the Neuron Knowledge Science group, and works with the open supply group to allow the usage of Inferentia and Trainium. Jim holds a bachelor’s diploma in arithmetic from Carnegie Mellon College and a grasp’s diploma in economics from the College of Virginia.
Miriam Lebowitz is a Options Architect centered on empowering early-stage startups at AWS. She leverages her expertise with AIML to information corporations to pick out and implement the best applied sciences for his or her enterprise aims, setting them up for scalable development and innovation within the aggressive startup world.
Rhia Soni is a Startup Options Architect at AWS. Rhia makes a speciality of working with early stage startups and helps prospects undertake Inferentia and Trainium. Rhia can also be a part of the AWS Analytics Technical Subject Neighborhood and is an issue skilled in Generative BI. Rhia holds a bachelor’s diploma in Data Science from the College of Maryland.
Paul Aiuto is a Senior Answer Architect Supervisor specializing in Startups at AWS. Paul created a staff of AWS Startup Answer architects that concentrate on the adoption of Inferentia and Trainium. Paul holds a bachelor’s diploma in Pc Science from Siena Faculty and has a number of Cyber Safety certifications.