PixArt-Sigma is a diffusion transformer mannequin that’s able to picture era at 4k decision. This mannequin exhibits important enhancements over earlier era PixArt fashions like Pixart-Alpha and different diffusion fashions by dataset and architectural enhancements. AWS Trainium and AWS Inferentia are purpose-built AI chips to speed up machine studying (ML) workloads, making them superb for cost-effective deployment of enormous generative fashions. By utilizing these AI chips, you possibly can obtain optimum efficiency and effectivity when working inference with diffusion transformer fashions like PixArt-Sigma.
This put up is the primary in a collection the place we are going to run a number of diffusion transformers on Trainium and Inferentia-powered cases. On this put up, we present how one can deploy PixArt-Sigma to Trainium and Inferentia-powered cases.
Answer overview
The steps outlined beneath will likely be used to deploy the PixArt-Sigma mannequin on AWS Trainium and run inference on it to generate high-quality pictures.
- Step 1 – Pre-requisites and setup
- Step 2 – Obtain and compile the PixArt-Sigma mannequin for AWS Trainium
- Step 3 – Deploy the mannequin on AWS Trainium to generate pictures
Step 1 – Conditions and setup
To get began, you have to to arrange a improvement setting on a trn1, trn2, or inf2 host. Full the next steps:
- Launch a
trn1.32xlarge
ortrn2.48xlarge
occasion with a Neuron DLAMI. For directions on tips on how to get began, check with Get Began with Neuron on Ubuntu 22 with Neuron Multi-Framework DLAMI. - Launch a Jupyter Pocket book sever. For directions to arrange a Jupyter server, check with the next consumer information.
- Clone the aws-neuron-samples GitHub repository:
- Navigate to the hf_pretrained_pixart_sigma_1k_latency_optimized.ipynb pocket book:
The offered instance script is designed to run on a Trn2 occasion, however you possibly can adapt it for Trn1 or Inf2 cases with minimal modifications. Particularly, throughout the pocket book and in every of the element recordsdata beneath the neuron_pixart_sigma
listing, you can see commented-out modifications to accommodate Trn1 or Inf2 configurations.
Step 2 – Obtain and compile the PixArt-Sigma mannequin for AWS Trainium
This part supplies a step-by-step information to compiling PixArt-Sigma for AWS Trainium.
Obtain the mannequin
You will see a helper operate in cache-hf-model.py in above talked about GitHub repository that exhibits tips on how to obtain the PixArt-Sigma mannequin from Hugging Face. In case you are utilizing PixArt-Sigma in your personal workload, and choose to not use the script included on this put up, you need to use the huggingface-cli to obtain the mannequin as a substitute.
The Neuron PixArt-Sigma implementation comprises just a few scripts and courses. The assorted recordsdata and scrips are damaged down as follows:
├── compile_latency_optimized.sh # Full Mannequin Compilation script for Latency Optimized
├── compile_throughput_optimized.sh # Full Mannequin Compilation script for Throughput Optimized
├── hf_pretrained_pixart_sigma_1k_latency_optimized.ipynb # Pocket book to run Latency Optimized Pixart-Sigma
├── hf_pretrained_pixart_sigma_1k_throughput_optimized.ipynb # Pocket book to run Throughput Optimized Pixart-Sigma
├── neuron_pixart_sigma
│ ├── cache_hf_model.py # Mannequin downloading Script
│ ├── compile_decoder.py # Textual content Encoder Compilation Script and Wrapper Class
│ ├── compile_text_encoder.py # Textual content Encoder Compilation Script and Wrapper Class
│ ├── compile_transformer_latency_optimized.py # Latency Optimized Transformer Compilation Script and Wrapper Class
│ ├── compile_transformer_throughput_optimized.py # Throughput Optimized Transformer Compilation Script and Wrapper Class
│ ├── neuron_commons.py # Base Courses and Consideration Implementation
│ └── neuron_parallel_utils.py # Sharded Consideration Implementation
└── necessities.txt
This pocket book will show you how to to obtain the mannequin, compile the person element fashions, and invoke the era pipeline to generate a picture. Though the notebooks will be run as a standalone pattern, the following few sections of this put up will stroll by the important thing implementation particulars throughout the element recordsdata and scripts to help working PixArt-Sigma on Neuron.
For every element of PixArt (T5, Transformer, and VAE), the instance makes use of Neuron particular wrapper courses. These wrapper courses serve two functions. The primary function is it permits us to hint the fashions for compilation:
class InferenceTextEncoderWrapper(nn.Module):
def __init__(self, dtype, t: T5EncoderModel, seqlen: int):
tremendous().__init__()
self.dtype = dtype
self.gadget = t.gadget
self.t = t
def ahead(self, text_input_ids, attention_mask=None):
return [self.t(text_input_ids, attention_mask)['last_hidden_state'].to(self.dtype)]
Please check with the neuron_commons.py file for all wrapper modules and courses.
The second motive for utilizing wrapper courses is to switch the eye implementation to run on Neuron. As a result of diffusion fashions like PixArt are sometimes compute-bound, you possibly can enhance efficiency by sharding the eye layer throughout a number of units. To do that, you change the linear layers with NeuronX Distributed’s RowParallelLinear and ColumnParallelLinear layers:
def shard_t5_self_attention(tp_degree: int, selfAttention: T5Attention):
orig_inner_dim = selfAttention.q.out_features
dim_head = orig_inner_dim // selfAttention.n_heads
original_nheads = selfAttention.n_heads
selfAttention.n_heads = selfAttention.n_heads // tp_degree
selfAttention.inner_dim = dim_head * selfAttention.n_heads
orig_q = selfAttention.q
selfAttention.q = ColumnParallelLinear(
selfAttention.q.in_features,
selfAttention.q.out_features,
bias=False,
gather_output=False)
selfAttention.q.weight.information = get_sharded_data(orig_q.weight.information, 0)
del(orig_q)
orig_k = selfAttention.ok
selfAttention.ok = ColumnParallelLinear(
selfAttention.ok.in_features,
selfAttention.ok.out_features,
bias=(selfAttention.ok.bias shouldn't be None),
gather_output=False)
selfAttention.ok.weight.information = get_sharded_data(orig_k.weight.information, 0)
del(orig_k)
orig_v = selfAttention.v
selfAttention.v = ColumnParallelLinear(
selfAttention.v.in_features,
selfAttention.v.out_features,
bias=(selfAttention.v.bias shouldn't be None),
gather_output=False)
selfAttention.v.weight.information = get_sharded_data(orig_v.weight.information, 0)
del(orig_v)
orig_out = selfAttention.o
selfAttention.o = RowParallelLinear(
selfAttention.o.in_features,
selfAttention.o.out_features,
bias=(selfAttention.o.bias shouldn't be None),
input_is_parallel=True)
selfAttention.o.weight.information = get_sharded_data(orig_out.weight.information, 1)
del(orig_out)
return selfAttention
Please check with the neuron_parallel_utils.py file for extra particulars on parallel consideration.
Compile particular person sub-models
The PixArt-Sigma mannequin consists of three elements. Every element is compiled so the whole era pipeline can run on Neuron:
- Textual content encoder – A 4-billion-parameter encoder, which interprets a human-readable immediate into an embedding. Within the textual content encoder, the eye layers are sharded, together with the feed-forward layers, with tensor parallelism.
- Denoising transformer mannequin – A 700-million-parameter transformer, which iteratively denoises a latent (a numerical illustration of a compressed picture). Within the transformer, the eye layers are sharded, together with the feed-forward layers, with tensor parallelism.
- Decoder – A VAE decoder that converts our denoiser-generated latent to an output picture. For the decoder, the mannequin is deployed with information parallelism.
Now that the mannequin definition is prepared, you might want to hint a mannequin to run it on Trainium or Inferentia. You’ll be able to see tips on how to use the hint()
operate to compile the decoder element mannequin for PixArt within the following code block:
compiled_decoder = torch_neuronx.hint(
decoder,
sample_inputs,
compiler_workdir=f"{compiler_workdir}/decoder",
compiler_args=compiler_flags,
inline_weights_to_neff=False
)
Please check with the compile_decoder.py file for extra on tips on how to instantiate and compile the decoder.
To run fashions with tensor parallelism, a method used to separate a tensor into chunks throughout a number of NeuronCores, you might want to hint with a pre-specified tp_degree
. This tp_degree
specifies the variety of NeuronCores to shard the mannequin throughout. It then makes use of the parallel_model_trace
API to compile the encoder and transformer element fashions for PixArt:
compiled_text_encoder = neuronx_distributed.hint.parallel_model_trace(
get_text_encoder_f,
sample_inputs,
compiler_workdir=f"{compiler_workdir}/text_encoder",
compiler_args=compiler_flags,
tp_degree=tp_degree,
)
Please check with the compile_text_encoder.py file for extra particulars on tracing the encoder with tensor parallelism.
Lastly, you hint the transformer mannequin with tensor parallelism:
compiled_transformer = neuronx_distributed.hint.parallel_model_trace(
get_transformer_model_f,
sample_inputs,
compiler_workdir=f"{compiler_workdir}/transformer",
compiler_args=compiler_flags,
tp_degree=tp_degree,
inline_weights_to_neff=False,
)
Please check with the compile_transformer_latency_optimized.py file for extra particulars on tracing the transformer with tensor parallelism.
You’ll use the compile_latency_optimized.sh script to compile all three fashions as described on this put up, so these capabilities will likely be run routinely whenever you run by the pocket book.
Step 3 – Deploy the mannequin on AWS Trainium to generate pictures
This part will stroll us by the steps to run inference on PixArt-Sigma on AWS Trainium.
Create a diffusers pipeline object
The Hugging Face diffusers library is a library for pre-trained diffusion fashions, and consists of model-specific pipelines that bundle the elements (independently-trained fashions, schedulers, and processors) wanted to run a diffusion mannequin. The PixArtSigmaPipeline
is particular to the PixArtSigma mannequin, and is instantiated as follows:
pipe: PixArtSigmaPipeline = PixArtSigmaPipeline.from_pretrained(
"PixArt-alpha/PixArt-Sigma-XL-2-1024-MS",
torch_dtype=torch.bfloat16,
local_files_only=True,
cache_dir="pixart_sigma_hf_cache_dir_1024")
Please check with the hf_pretrained_pixart_sigma_1k_latency_optimized.ipynb pocket book for particulars on pipeline execution.
Load compiled element fashions into the era pipeline
After every element mannequin has been compiled, load them into the general era pipeline for picture era. The VAE mannequin is loaded with information parallelism, which permits us to parallelize picture era for batch measurement or a number of pictures per immediate. For extra particulars, check with the hf_pretrained_pixart_sigma_1k_latency_optimized.ipynb pocket book.
vae_decoder_wrapper.mannequin = torch_neuronx.DataParallel(
torch.jit.load(decoder_model_path), [0, 1, 2, 3], False
)
text_encoder_wrapper.t = neuronx_distributed.hint.parallel_model_load(
text_encoder_model_path
)
Lastly, the loaded fashions are added to the era pipeline:
pipe.text_encoder = text_encoder_wrapper
pipe.transformer = transformer_wrapper
pipe.vae.decoder = vae_decoder_wrapper
pipe.vae.post_quant_conv = vae_post_quant_conv_wrapper
Compose a immediate
Now that the mannequin is prepared, you possibly can write a immediate to convey what sort of picture you need generated. When making a immediate, you must all the time be as particular as doable. You should utilize a constructive immediate to convey what is needed in your new picture, together with a topic, motion, fashion, and site, and may use a unfavourable immediate to point options that must be eliminated.
For instance, you need to use the next constructive and unfavourable prompts to generate a photograph of an astronaut using a horse on mars with out mountains:
# Topic: astronaut
# Motion: using a horse
# Location: Mars
# Type: photograph
immediate = "a photograph of an astronaut using a horse on mars"
negative_prompt = "mountains"
Be happy to edit the immediate in your pocket book utilizing immediate engineering to generate a picture of your selecting.
Generate a picture
To generate a picture, you move the immediate to the PixArt mannequin pipeline, after which save the generated picture for later reference:
# pipe: variable holding the Pixart era pipeline with every of
# the compiled element fashions
pictures = pipe(
immediate=immediate,
negative_prompt=negative_prompt,
num_images_per_prompt=1,
peak=1024, # variety of pixels
width=1024, # variety of pixels
num_inference_steps=25 # Variety of passes by the denoising mannequin
).pictures
for idx, img in enumerate(pictures):
img.save(f"image_{idx}.png")
Cleanup
To keep away from incurring extra prices, cease your EC2 occasion utilizing both the AWS Administration Console or AWS Command Line Interface (AWS CLI).
Conclusion
On this put up, we walked by tips on how to deploy PixArt-Sigma, a state-of-the-art diffusion transformer, on Trainium cases. This put up is the primary in a collection centered on working diffusion transformers for various era duties on Neuron. To be taught extra about working diffusion transformers fashions with Neuron, check with Diffusion Transformers.
In regards to the Authors
Achintya Pinninti is a Options Architect at Amazon Internet Companies. He helps public sector clients, enabling them to realize their targets utilizing the cloud. He focuses on constructing information and machine studying options to unravel advanced issues.
Miriam Lebowitz is a Options Architect centered on empowering early-stage startups at AWS. She leverages her expertise with AI/ML to information firms to pick and implement the suitable applied sciences for his or her enterprise targets, setting them up for scalable progress and innovation within the aggressive startup world.
Sadaf Rasool is a Options Architect in Annapurna Labs at AWS. Sadaf collaborates with clients to design machine studying options that handle their crucial enterprise challenges. He helps clients practice and deploy machine studying fashions leveraging AWS Trainium or AWS Inferentia chips to speed up their innovation journey.
John Grey is a Options Architect in Annapurna Labs, AWS, based mostly out of Seattle. On this function, John works with clients on their AI and machine studying use instances, architects options to cost-effectively resolve their enterprise issues, and helps them construct a scalable prototype utilizing AWS AI chips.