Automationscribe.com
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us
No Result
View All Result
Automation Scribe
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us
No Result
View All Result
Automationscribe.com
No Result
View All Result

The right way to run Qwen 2.5 on AWS AI chips utilizing Hugging Face libraries

admin by admin
March 17, 2025
in Artificial Intelligence
0
The right way to run Qwen 2.5 on AWS AI chips utilizing Hugging Face libraries
399
SHARES
2.3k
VIEWS
Share on FacebookShare on Twitter


The Qwen 2.5 multilingual massive language fashions (LLMs) are a group of pre-trained and instruction tuned generative fashions in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B (textual content in/textual content out and code out). The Qwen 2.5 tremendous tuned text-only fashions are optimized for multilingual dialogue use instances and outperform each earlier generations of Qwen fashions, and most of the publicly out there chat fashions primarily based on widespread business benchmarks.

At its core, Qwen 2.5 is an auto-regressive language mannequin that makes use of an optimized transformer structure. The Qwen2.5 assortment can help over 29 languages and has enhanced role-playing skills and condition-setting for chatbots.

On this submit, we define tips on how to get began with deploying the Qwen 2.5 household of fashions on an Inferentia occasion utilizing Amazon Elastic Compute Cloud (Amazon EC2) and Amazon SageMaker utilizing the Hugging Face Textual content Technology Inference (TGI) container and the Hugging Face Optimum Neuron library. Qwen2.5 Coder and Math variants are additionally supported.

Preparation

Hugging Face gives two instruments which might be continuously used when utilizing AWS Inferentia and AWS Trainium: Textual content Technology Inference (TGI) containers, which give help for deploying and serving LLMS, and the Optimum Neuron library, which serves as an interface between the Transformers library and the Inferentia and Trainium accelerators.

The primary time a mannequin is run on Inferentia or Trainium, you compile the mannequin to just be sure you have a model that can carry out optimally on Inferentia and Trainium chips. The Optimum Neuron library from Hugging Face together with the Optimum Neuron cache will transparently provide a compiled mannequin when out there. Should you’re utilizing a unique mannequin with the Qwen2.5 structure, you would possibly must compile the mannequin earlier than deploying. For extra info, see Compiling a mannequin for Inferentia or Trainium.

You possibly can deploy TGI as a docker container on an Inferentia or Trainium EC2 occasion or on Amazon SageMaker.

Choice 1: Deploy TGI on Amazon EC2 Inf2

On this instance, you’ll deploy Qwen2.5-7B-Instruct on an inf2.xlarge occasion. (See this text for detailed directions on tips on how to deploy an occasion utilizing the Hugging Face DLAMI.)

For this selection, you SSH into the occasion and create a .env file (the place you’ll outline your constants and specify the place your mannequin is cached) and a file named docker-compose.yaml (the place you’ll outline all the setting parameters that you just’ll must deploy your mannequin for inference). You possibly can copy the next information for this use case.

  1. Create a .env file with the next content material:
MODEL_ID='Qwen/Qwen2.5-7B-Instruct'
#MODEL_ID='/knowledge/exportedmodel' 
HF_AUTO_CAST_TYPE='bf16' # signifies the auto forged kind that was used to compile the mannequin
MAX_BATCH_SIZE=4
MAX_INPUT_TOKENS=4000
MAX_TOTAL_TOKENS=4096

  1. Create a file named docker-compose.yaml with the next content material:
model: '3.7'

providers:
  tgi-1:
    picture: ghcr.io/huggingface/neuronx-tgi:newest
    ports:
      - "8081:8081"
    setting:
      - PORT=8081
      - MODEL_ID=${MODEL_ID}
      - HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
      - HF_NUM_CORES=2
      - MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
      - MAX_INPUT_TOKENS=${MAX_INPUT_TOKENS}
      - MAX_TOTAL_TOKENS=${MAX_TOTAL_TOKENS}
      - MAX_CONCURRENT_REQUESTS=512
      #- HF_TOKEN=${HF_TOKEN} #solely wanted for gated fashions
    volumes:
      - $PWD:/knowledge #could be eliminated in case you aren't loading domestically
    gadgets:
      - "/dev/neuron0"

  1. Use docker compose to deploy the mannequin:

docker compose -f docker-compose.yaml --env-file .env up

  1. To verify that the mannequin deployed appropriately, ship a take a look at immediate to the mannequin:
curl 127.0.0.1:8081/generate 
    -X POST 
    -d '{
  "inputs":"Inform me about AWS.",
  "parameters":{
    "max_new_tokens":60
  }
}' 
    -H 'Content material-Kind: software/json'

  1. To verify that the mannequin can reply in a number of languages, strive sending a immediate in Chinese language:
#"Inform me tips on how to open an AWS account"
curl 127.0.0.1:8081/generate 
    -X POST 
    -d '{
  "inputs":"告诉我如何开设 AWS 账户。", 
  "parameters":{
    "max_new_tokens":60
  }
}' 
    -H 'Content material-Kind: software/json'

Choice 2: Deploy TGI on SageMaker

It’s also possible to use Hugging Face’s Optimum Neuron library to shortly deploy fashions immediately from SageMaker utilizing directions on the Hugging Face Mannequin Hub.

  1. From the Qwen 2.5 mannequin card hub, select Deploy, then SageMaker, and at last AWS Inferentia & Trainium.

How to deploy the model on Amazon SageMaker

How to find the code you'll need to deploy the model using AWS Inferentia and Trainium

  1. Copy the instance code right into a SageMaker pocket book, then select Run.
  2. The pocket book you copied will appear like the next:
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

strive:
    function = sagemaker.get_execution_role()
besides ValueError:
    iam = boto3.consumer("iam")
    function = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

# Hub Mannequin configuration. https://huggingface.co/fashions
hub = {
    "HF_MODEL_ID": "Qwen/Qwen2.5-7B-Instruct",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "bf16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_TOKENS": "3686",
    "MAX_TOTAL_TOKENS": "4096",
}


area = boto3.Session().region_name
image_uri = f"763104351884.dkr.ecr.{area}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.27-neuronx-py310-ubuntu22.04"

# create Hugging Face Mannequin Class
huggingface_model = HuggingFaceModel(
    image_uri=image_uri,
    env=hub,
    function=function,
)

# deploy mannequin to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

# ship request
predictor.predict(
    {
        "inputs": "What's is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

Clear Up

Just remember to terminate your EC2 cases and delete your SageMaker endpoints to keep away from ongoing prices.

Terminate EC2 cases by the AWS Administration Console.

Terminate a SageMaker endpoint by the console or with the next instructions:

predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

Conclusion

AWS Trainium and AWS Inferentia ship excessive efficiency and low price for deploying Qwen2.5 fashions. We’re excited to see how you’ll use these highly effective fashions and our purpose-built AI infrastructure to construct differentiated AI functions. To study extra about tips on how to get began with AWS AI chips, see the AWS Neuron documentation.


Concerning the Authors

Jim Burtoft is a Senior Startup Options Architect at AWS and works immediately with startups in addition to the staff at Hugging Face. Jim is a CISSP, a part of the AWS AI/ML Technical Subject Neighborhood, a part of the Neuron Knowledge Science group, and works with the open supply group to allow the usage of Inferentia and Trainium. Jim holds a bachelor’s diploma in arithmetic from Carnegie Mellon College and a grasp’s diploma in economics from the College of Virginia.

Miriam Lebowitz ProfileMiriam Lebowitz is a Options Architect centered on empowering early-stage startups at AWS. She leverages her expertise with AIML to information corporations to pick out and implement the best applied sciences for his or her enterprise aims, setting them up for scalable development and innovation within the aggressive startup world.

Rhia Soni is a Startup Options Architect at AWS. Rhia makes a speciality of working with early stage startups and helps prospects undertake Inferentia and Trainium. Rhia can also be a part of the AWS Analytics Technical Subject Neighborhood and is an issue skilled in Generative BI. Rhia holds a bachelor’s diploma in Data Science from the College of Maryland.

Paul Aiuto is a Senior Answer Architect Supervisor specializing in Startups at AWS. Paul created a staff of AWS Startup Answer architects that concentrate on the adoption of Inferentia and Trainium. Paul holds a bachelor’s diploma in Pc Science from Siena Faculty and has a number of Cyber Safety certifications.

Tags: AWSchipsFaceHuggingLibrariesQwenRun
Previous Post

Mastering Immediate Engineering with Purposeful Testing: A Systematic Information to Dependable LLM Outputs 

Next Post

Neglect About Cloud Computing. On-Premises Is All of the Rage Once more

Next Post
Neglect About Cloud Computing. On-Premises Is All of the Rage Once more

Neglect About Cloud Computing. On-Premises Is All of the Rage Once more

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Popular News

  • How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

    How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

    401 shares
    Share 160 Tweet 100
  • Diffusion Mannequin from Scratch in Pytorch | by Nicholas DiSalvo | Jul, 2024

    401 shares
    Share 160 Tweet 100
  • Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

    401 shares
    Share 160 Tweet 100
  • Streamlit fairly styled dataframes half 1: utilizing the pandas Styler

    400 shares
    Share 160 Tweet 100
  • Proton launches ‘Privacy-First’ AI Email Assistant to Compete with Google and Microsoft

    400 shares
    Share 160 Tweet 100

About Us

Automation Scribe is your go-to site for easy-to-understand Artificial Intelligence (AI) articles. Discover insights on AI tools, AI Scribe, and more. Stay updated with the latest advancements in AI technology. Dive into the world of automation with simplified explanations and informative content. Visit us today!

Category

  • AI Scribe
  • AI Tools
  • Artificial Intelligence

Recent Posts

  • Insights in implementing production-ready options with generative AI
  • Producing Information Dictionary for Excel Information Utilizing OpenPyxl and AI Brokers
  • How Deutsche Bahn redefines forecasting utilizing Chronos fashions – Now obtainable on Amazon Bedrock Market
  • Home
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms & Conditions

© 2024 automationscribe.com. All rights reserved.

No Result
View All Result
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us

© 2024 automationscribe.com. All rights reserved.