The right way to run Qwen 2.5 on AWS AI chips utilizing Hugging Face libraries

The Qwen 2.5 multilingual massive language fashions (LLMs) are a group of pre-trained and instruction tuned generative fashions in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B (textual content in/textual content out and code out). The Qwen 2.5 tremendous tuned text-only fashions are optimized for multilingual dialogue use instances and outperform each earlier generations of Qwen fashions, and most of the publicly out there chat fashions primarily based on widespread business benchmarks.

At its core, Qwen 2.5 is an auto-regressive language mannequin that makes use of an optimized transformer structure. The Qwen2.5 assortment can help over 29 languages and has enhanced role-playing skills and condition-setting for chatbots.

On this submit, we define tips on how to get began with deploying the Qwen 2.5 household of fashions on an Inferentia occasion utilizing Amazon Elastic Compute Cloud (Amazon EC2) and Amazon SageMaker utilizing the Hugging Face Textual content Technology Inference (TGI) container and the Hugging Face Optimum Neuron library. Qwen2.5 Coder and Math variants are additionally supported.

Preparation

Hugging Face gives two instruments which might be continuously used when utilizing AWS Inferentia and AWS Trainium: Textual content Technology Inference (TGI) containers, which give help for deploying and serving LLMS, and the Optimum Neuron library, which serves as an interface between the Transformers library and the Inferentia and Trainium accelerators.

The primary time a mannequin is run on Inferentia or Trainium, you compile the mannequin to just be sure you have a model that can carry out optimally on Inferentia and Trainium chips. The Optimum Neuron library from Hugging Face together with the Optimum Neuron cache will transparently provide a compiled mannequin when out there. Should you’re utilizing a unique mannequin with the Qwen2.5 structure, you would possibly must compile the mannequin earlier than deploying. For extra info, see Compiling a mannequin for Inferentia or Trainium.

You possibly can deploy TGI as a docker container on an Inferentia or Trainium EC2 occasion or on Amazon SageMaker.

Choice 1: Deploy TGI on Amazon EC2 Inf2

On this instance, you’ll deploy Qwen2.5-7B-Instruct on an inf2.xlarge occasion. (See this text for detailed directions on tips on how to deploy an occasion utilizing the Hugging Face DLAMI.)

For this selection, you SSH into the occasion and create a .env file (the place you’ll outline your constants and specify the place your mannequin is cached) and a file named docker-compose.yaml (the place you’ll outline all the setting parameters that you just’ll must deploy your mannequin for inference). You possibly can copy the next information for this use case.

Create a .env file with the next content material:

MODEL_ID='Qwen/Qwen2.5-7B-Instruct'
#MODEL_ID='/knowledge/exportedmodel' 
HF_AUTO_CAST_TYPE='bf16' # signifies the auto forged kind that was used to compile the mannequin
MAX_BATCH_SIZE=4
MAX_INPUT_TOKENS=4000
MAX_TOTAL_TOKENS=4096

Create a file named docker-compose.yaml with the next content material:

model: '3.7'

providers:
  tgi-1:
    picture: ghcr.io/huggingface/neuronx-tgi:newest
    ports:
      - "8081:8081"
    setting:
      - PORT=8081
      - MODEL_ID=${MODEL_ID}
      - HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
      - HF_NUM_CORES=2
      - MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
      - MAX_INPUT_TOKENS=${MAX_INPUT_TOKENS}
      - MAX_TOTAL_TOKENS=${MAX_TOTAL_TOKENS}
      - MAX_CONCURRENT_REQUESTS=512
      #- HF_TOKEN=${HF_TOKEN} #solely wanted for gated fashions
    volumes:
      - $PWD:/knowledge #could be eliminated in case you aren't loading domestically
    gadgets:
      - "/dev/neuron0"

Use docker compose to deploy the mannequin:

docker compose -f docker-compose.yaml --env-file .env up

To verify that the mannequin deployed appropriately, ship a take a look at immediate to the mannequin:

curl 127.0.0.1:8081/generate 
    -X POST 
    -d '{
  "inputs":"Inform me about AWS.",
  "parameters":{
    "max_new_tokens":60
  }
}' 
    -H 'Content material-Kind: software/json'

To verify that the mannequin can reply in a number of languages, strive sending a immediate in Chinese language:

#"Inform me tips on how to open an AWS account"
curl 127.0.0.1:8081/generate 
    -X POST 
    -d '{
  "inputs":"告诉我如何开设 AWS 账户。", 
  "parameters":{
    "max_new_tokens":60
  }
}' 
    -H 'Content material-Kind: software/json'

Choice 2: Deploy TGI on SageMaker

It’s also possible to use Hugging Face’s Optimum Neuron library to shortly deploy fashions immediately from SageMaker utilizing directions on the Hugging Face Mannequin Hub.

From the Qwen 2.5 mannequin card hub, select Deploy, then SageMaker, and at last AWS Inferentia & Trainium.

Copy the instance code right into a SageMaker pocket book, then select Run.
The pocket book you copied will appear like the next:

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

strive:
    function = sagemaker.get_execution_role()
besides ValueError:
    iam = boto3.consumer("iam")
    function = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

# Hub Mannequin configuration. https://huggingface.co/fashions
hub = {
    "HF_MODEL_ID": "Qwen/Qwen2.5-7B-Instruct",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "bf16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_TOKENS": "3686",
    "MAX_TOTAL_TOKENS": "4096",
}


area = boto3.Session().region_name
image_uri = f"763104351884.dkr.ecr.{area}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.27-neuronx-py310-ubuntu22.04"

# create Hugging Face Mannequin Class
huggingface_model = HuggingFaceModel(
    image_uri=image_uri,
    env=hub,
    function=function,
)

# deploy mannequin to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

# ship request
predictor.predict(
    {
        "inputs": "What's is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

Clear Up

Just remember to terminate your EC2 cases and delete your SageMaker endpoints to keep away from ongoing prices.

Terminate EC2 cases by the AWS Administration Console.

Terminate a SageMaker endpoint by the console or with the next instructions:

predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

Conclusion

AWS Trainium and AWS Inferentia ship excessive efficiency and low price for deploying Qwen2.5 fashions. We’re excited to see how you’ll use these highly effective fashions and our purpose-built AI infrastructure to construct differentiated AI functions. To study extra about tips on how to get began with AWS AI chips, see the AWS Neuron documentation.

Concerning the Authors

Jim Burtoft is a Senior Startup Options Architect at AWS and works immediately with startups in addition to the staff at Hugging Face. Jim is a CISSP, a part of the AWS AI/ML Technical Subject Neighborhood, a part of the Neuron Knowledge Science group, and works with the open supply group to allow the usage of Inferentia and Trainium. Jim holds a bachelor’s diploma in arithmetic from Carnegie Mellon College and a grasp’s diploma in economics from the College of Virginia.

Miriam Lebowitz is a Options Architect centered on empowering early-stage startups at AWS. She leverages her expertise with AIML to information corporations to pick out and implement the best applied sciences for his or her enterprise aims, setting them up for scalable development and innovation within the aggressive startup world.

Rhia Soni is a Startup Options Architect at AWS. Rhia makes a speciality of working with early stage startups and helps prospects undertake Inferentia and Trainium. Rhia can also be a part of the AWS Analytics Technical Subject Neighborhood and is an issue skilled in Generative BI. Rhia holds a bachelor’s diploma in Data Science from the College of Maryland.

Paul Aiuto is a Senior Answer Architect Supervisor specializing in Startups at AWS. Paul created a staff of AWS Startup Answer architects that concentrate on the adoption of Inferentia and Trainium. Paul holds a bachelor’s diploma in Pc Science from Siena Faculty and has a number of Cyber Safety certifications.

The right way to run Qwen 2.5 on AWS AI chips utilizing Hugging Face libraries

Mastering Immediate Engineering with Purposeful Testing: A Systematic Information to Dependable LLM Outputs

Neglect About Cloud Computing. On-Premises Is All of the Rage Once more

Neglect About Cloud Computing. On-Premises Is All of the Rage Once more

Leave a Reply Cancel reply

Popular News

How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

Diffusion Mannequin from Scratch in Pytorch | by Nicholas DiSalvo | Jul, 2024

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

Proton launches ‘Privacy-First’ AI Email Assistant to Compete with Google and Microsoft

Streamlit fairly styled dataframes half 1: utilizing the pandas Styler

About Us

Category

Recent Posts