ninth in our sequence on efficiency profiling and optimization in PyTorch geared toward emphasizing the essential function of efficiency evaluation and optimization in machine studying growth. All through the sequence we now have reviewed all kinds of sensible instruments and strategies for analyzing and boosting the runtime efficiency of PyTorch-based AI/ML fashions. Our objective has been twofold:
- To emphasise the significance of routine analysis and optimization of AI/ML workloads.
- To exhibit the accessibility of all kinds instruments and strategies for analyzing and optimizing AI/ML runtime efficiency. You don’t must be a CUDA knowledgeable to meaningfully enhance your mannequin efficiency and cut back compute prices.
On this publish, we’ll discover using CUDA streams, a strong characteristic of NVIDIA’s CUDA programming mannequin that provides a complicated technique of overlapping GPU operations and working them concurrently. Though we sometimes affiliate our AI/ML mannequin coaching workload with a single monolithic (a.okay.a., “unbreakable”) computation graph G working on the GPU, there are some eventualities the place the graph might be decomposed into two distinct subgraphs G1 and G2, the place G=G2*G1. In such instances CUDA streams allow “pipelining” the computation graph, i.e., programming our coaching step to run G1 (on batch enter n+1) in parallel to G2 (on the nth output of G1). This system is particularly helpful when:
- Neither subgraph absolutely makes use of the GPU when run alone, and
- The 2 subgraphs are of comparable computational price (i.e., neither dominates runtime).
We’ll discover two frequent eventualities the place “pipelining” is possible:
- Partial-model coaching or finetuning:
It’s frequent to freeze a pre-trained mannequin spine (e.g., characteristic extractor or encoder) and prepare solely a mannequin head (e.g., decoder). For the reason that frozen spine doesn’t depend on gradients from the head, the 2 might be executed concurrently. - Offloading knowledge preprocessing to the GPU:
A typical technique for addressing bottlenecks within the enter pipeline (often known as GPU hunger), knowledge preprocessing might be moved to the GPU. Whereas prepending preprocessing operations to the mannequin graph improves efficiency, extra positive aspects might be achieved by working preprocessing on a separate CUDA stream in parallel with mannequin execution—assuming preprocessing isn’t trivial in comparison with mannequin compute.
To facilitate our dialogue, we’ll outline two toy coaching scripts and measure the coaching efficiency beneath totally different eventualities. The experiments had been run on an Amazon EC2 g5.2xlarge occasion (containing an NVIDIA A10G GPU and eight vCPUs) working a PyTorch (2.6) Deep Studying AMI (DLAMI).
Please word: the code snippets that we share are for demonstration functions solely —please don’t depend on their correctness or optimality. The affect of utilizing CUDA streams will range relying on mannequin structure and system configuration. We encourage you to conduct your individual profiling and experimentation earlier than integrating CUDA streams (or some other device approach we confer with) into your workflow.
Half 1: Pipelining an Encoder-Decoder Mannequin
The primary use-case we discover entails a CNN-based picture segmentation mannequin consisting of a hard and fast (pre-trained) encoder and a trainable decoder. On this situation, for the reason that encoder weights are frozen and unaffected by backpropagation, the encoder might be executed independently of the decoder’s coaching. On this part, we assess the affect of pipelining the coaching course of utilizing CUDA streams.
A Toy Picture Segmentation Coaching Experiment
We start by defining a easy CNN-based picture encoder together with its corresponding decoder.
undefined
Subsequent, we assemble an artificial dataset of random photographs and segmentation maps.
from torch.utils.knowledge import DataLoader
from torchvision.datasets.imaginative and prescient import VisionDataset
# A dataset with random photographs and per-pixel labels
class FakeDataset(VisionDataset):
def __init__(self):
tremendous().__init__(root=None)
self.measurement = 1000000
def __getitem__(self, index):
# create a random picture
img = torch.randint(0, 256, (3, img_size, img_size),
dtype=torch.uint8)
# create a random label map
goal = torch.randint(0, num_classes, (img_size, img_size))
return img, goal
def __len__(self):
return self.measurement
train_set = FakeDataset()
train_loader = DataLoader(
dataset=train_set,
batch_size=8,
num_workers=8
)
Lastly, we outline the loss operate, optimizer, and coaching loop. Word, that we freeze the encoder’s weights and prepare solely the decoder.
import time
gadget = torch.gadget("cuda")
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(decoder.parameters())
# Freeze the encoder weights
encoder.requires_grad_(False)
encoder.eval().to(gadget)
decoder.prepare().to(gadget)
warmup = 10
active_batches = 100
total_iters = warmup + active_batches
for idx, knowledge in enumerate(train_loader):
inputs = knowledge[0].to(gadget=gadget, non_blocking=True).float()
labels = knowledge[1].to(gadget=gadget, non_blocking=True)
optimizer.zero_grad()
with torch.no_grad():
options = encoder(inputs)
output = decoder(options)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
if idx == warmup:
# sync the GPU and begin the timer
torch.cuda.synchronize()
t0 = time.perf_counter()
if idx == total_iters:
break
# look forward to the GPU to finnish after which cease the timer
torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')
Our baseline coaching script achieves a median throughput of 83 steps per second, with a median GPU utilization of 85%.
Pipelining the Mannequin Execution With CUDA Streams
Within the revised model of the coaching loop proven under, we introduce two CUDA streams: one for executing the encoder and one for coaching the decoder. In every iteration, we carry out two operations concurrently:
- Prepare the decoder utilizing the picture options and labels from batch N.
- Execute the encoder on enter batch N+1 to generate its picture options.
encoder_stream = torch.cuda.Stream()
decoder_stream = torch.cuda.Stream()
# initialize the options to None
options = None
for idx, knowledge in enumerate(train_loader):
inputs = knowledge[0].to(gadget, non_blocking=True).float()
labels_next = knowledge[1].to(gadget, non_blocking=True)
if options will not be None:
with torch.cuda.stream(decoder_stream):
decoder_stream.wait_stream(encoder_stream)
optimizer.zero_grad()
output = decoder(options)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
with torch.cuda.stream(encoder_stream):
with torch.no_grad():
options = encoder(inputs)
# File that options was produced on s1_backbone
options.record_stream(encoder_stream)
labels = labels_next
if idx == warmup:
# sync the GPU and begin the timer
torch.cuda.synchronize()
t0 = time.perf_counter()
if idx == total_iters:
break
# look forward to the GPU to complete after which cease the timer
torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')
This modification yields a median throughput of 91 steps per second, representing a 9.6% speedup. This can be a important enchancment — particularly contemplating that our baseline already had excessive GPU utilization (85%).
Sensitivity of Pipelining to Workload Properties
The effectiveness of pipelining with CUDA streams is very depending on the specifics of the coaching workload and runtime atmosphere. If the encoder is considerably bigger than the decoder (or vice versa), pipelining might supply little profit and even hinder efficiency. Conversely, when the GPU is underutilized, pipelining tends to yield extra substantial positive aspects.
As an instance this dependency, we reran the experiment with various batch sizes. The outcomes are summarized under:

Because the batch measurement will increase, the advantage of pipelining diminishes. That is seemingly as a result of bigger batch sizes naturally result in larger (and extra environment friendly) GPU utilization, leaving much less room for enchancment by means of concurrent execution.
Half 2: Offloading Augmentations onto the GPU
On this part, we’ll apply using CUDA streams to the acceleration of knowledge augmentation. In earlier weblog posts (e.g., right here and right here), we now have studied the issue of bottlenecks on the information enter pipeline from totally different views and reviewed a number of strategies for diagnosing and addressing them. A typical causes of those bottlenecks is CPU useful resource exhaustion, the place the CPU can not meet the computational calls for of the preprocessing pipeline. The result’s GPU hunger — a situation wherein the costly GPU sits idle, ready for knowledge to reach.
One efficient answer is to dump heavy knowledge preprocessing to the GPU. We’ll exhibit this system and take it a step additional by executing the augmentations on a devoted CUDA stream, enabling concurrent execution with the mannequin coaching.
A Toy Picture Classification Coaching Experiment
We start by defining a easy CNN-based picture classification mannequin:
import torch
import torch.nn as nn
import torch
import torch.nn as nn
img_size = 256
num_classes = 10
mannequin = nn.Sequential(
# Begin with 256x256 picture
nn.Conv2d(3, 16, kernel_size=1),
nn.ReLU(inplace=True),
nn.Conv2d(16, 32, kernel_size=2, stride=2), # 2x downsample
nn.ReLU(inplace=True),
nn.Conv2d(32, 64, kernel_size=2, stride=2), # 4x downsample
nn.ReLU(inplace=True),
nn.Conv2d(64, 128, kernel_size=2, stride=2), # 8x downsample
nn.ReLU(inplace=True),
nn.Conv2d(128, 256, kernel_size=2, stride=2), # 16x downsample
nn.ReLU(inplace=True),
nn.Conv2d(256, 512, kernel_size=2, stride=2), # 32x downsample
nn.ReLU(inplace=True),
nn.Conv2d(512, 1024, kernel_size=2, stride=2), # 64x downsample
nn.ReLU(inplace=True),
nn.Conv2d(1024, 2048, kernel_size=2, stride=2), # 128X downsample
nn.ReLU(inplace=True),
nn.Conv2d(2048, 4096, kernel_size=2, stride=2), # 256X
nn.Flatten(),
nn.Linear(4096, num_classes)
)
Subsequent, we create an artificial dataset with an augmentation pipeline deliberately designed to trigger a extreme efficiency bottleneck:
import random
from torch.utils.knowledge import DataLoader
import torchvision.transforms.v2 as T
from torchvision.datasets.imaginative and prescient import VisionDataset
import torchvision.transforms.v2.practical as F
import torchvision.ops as ops
# A dataset with random photographs and labels
class FakeDataset(VisionDataset):
def __init__(self, rework = None):
tremendous().__init__(root=None, rework=rework)
self.measurement = 1000000
def __getitem__(self, index):
# create a random picture
img = torch.randint(0, 256, (3, img_size, img_size),
dtype=torch.uint8)
# create a random label
goal = torch.randint(0, num_classes, (1, ))
if self.rework:
# Apply tranformations
img = self.rework(img)
return img, goal
def __len__(self):
return self.measurement
augmentations = T.Compose([
T.ToDtype(torch.float32),
T.RandomCrop(img_size//2),
T.Resize(img_size),
T.RandomRotation(degrees=45.0),
T.GaussianBlur(kernel_size=7),
T.Normalize(mean=[0, 0, 0], std=[1, 1, 1])
])
train_set = FakeDataset(rework=augmentations)
train_loader = DataLoader(
dataset=train_set,
batch_size=32,
num_workers=8
)
Lastly, we outline the loss operate, optimizer, and coaching loop:
import time
gadget = torch.gadget("cuda")
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(mannequin.parameters())
mannequin.prepare().to(gadget)
warmup = 10
active_batches = 100
total_iters = warmup + active_batches
for idx, knowledge in enumerate(train_loader):
inputs = knowledge[0].to(gadget=gadget, non_blocking=True)
labels = knowledge[1].to(gadget=gadget, non_blocking=True).squeeze()
optimizer.zero_grad()
output = mannequin(inputs)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
if idx == warmup:
# sync the GPU and begin the timer
torch.cuda.synchronize()
t0 = time.perf_counter()
if idx == total_iters:
break
# look forward to the GPU to finnish after which cease the timer
torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')
Operating this baseline script ends in a median throughput of 20.41 steps per second and a GPU utilization of solely 42%. The heavy knowledge augmentations are choking the CPU resulting in GPU hunger. See our earlier publish for extra data on detecting bottlenecks on the information enter pipeline.
Offloading Information Augmentations to the GPU
To deal with the efficiency bottleneck on the information enter pipeline, we transfer the augmentations onto the GPU.
Step one is to outline customized knowledge transforms that apply random rotations and crops per pattern in a batch. That is vital as a result of the built-in torchvision transforms apply the identical augmentation throughout the complete batch — dropping the per-sample randomness seen on the CPU.
We implement the BatchRandomCrop rework utilizing the roi_align operator.
class BatchRandomCrop(T.Rework):
def __init__(self, output_size):
tremendous().__init__()
self.output_size = output_size
def rework(self, img: torch.Tensor, params: dict):
batch_size, _, original_height, original_width = img.form
gadget = img.gadget
max_top = original_height - self.output_size
max_left = original_width - self.output_size
# Generate random high and left coords for every picture within the batch
random_top = torch.randint(0, max_top + 1, (batch_size,),
gadget=gadget, dtype=torch.float32)
random_left = torch.randint(0, max_left + 1, (batch_size,),
gadget=gadget, dtype=torch.float32)
image_indices = torch.arange(batch_size, gadget=gadget,
dtype=torch.float32)
bins = torch.stack([
image_indices,
random_left,
random_top,
random_left + self.output_size,
random_top + self.output_size
], dim=1)
cropped_batch = ops.roi_align(
img,
bins,
output_size=self.output_size
)
return cropped_batch
We implement the BatchRandomRotate transfrom by iterating over all the photographs within the batch and making use of a random rotation to every one. Word that this model will not be vectorized; a completely vectorized implementation can be extra would require better effort.
class BatchRandomRotation(T.Rework):
def __init__(self, levels):
tremendous().__init__()
self .levels = levels
def rework(self, inpt: torch.Tensor, params: dict):
# cut up the batch into an inventory of particular person photographs
photographs = checklist(torch.unbind(inpt, dim=0))
augmented_images = []
for img_tensor in photographs:
# generate a random angle
angle = random.uniform(-self.levels, self.levels)
# apply the rotation to the one picture
transformed_img = F.rotate(
img_tensor,
angle=angle
)
augmented_images.append(transformed_img)
# stack the remodeled photographs
return torch.stack(augmented_images, dim=0)
We now outline batch_transform that mimics the CPU-based augmentation pipeline outlined above:
batch_transform = T.Compose([
T.ToDtype(torch.float32),
BatchRandomCrop(img_size//2),
T.Resize(img_size),
BatchRandomRotation(degrees=45.0),
T.GaussianBlur(kernel_size=7),
T.Normalize(mean=[0, 0, 0], std=[1, 1, 1])
])
Lastly, we reset the dataset and replace the coaching loop to use the brand new batch_transform:
train_set = FakeDataset(rework=None)
train_loader = DataLoader(
dataset=train_set,
batch_size=32,
num_workers=8
)
for idx, knowledge in enumerate(train_loader):
inputs = knowledge[0].to(gadget=gadget, non_blocking=True)
labels = knowledge[1].to(gadget=gadget, non_blocking=True).squeeze()
# apply augmentations
inputs = batch_transform(inputs)
optimizer.zero_grad()
output = mannequin(inputs)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
if idx == warmup:
torch.cuda.synchronize()
t0 = time.perf_counter()
if idx == total_iters:
break
torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')
This up to date coaching script improves throughput to 35.22 steps per second — a 72.57% speedup over the baseline consequence.
Pipelining Augmentations With CUDA Streams
Subsequent, we pipeline the augmentation and coaching steps utilizing two separate CUDA streams: one for working the information rework one for coaching the mannequin. In every iteration of the loop we carry out two concurrent operations:
- We prepare the mannequin on the augmented batch N.
- Carry out GPU-based knowledge augmentations on batch N+1
transform_stream = torch.cuda.Stream()
model_stream = torch.cuda.Stream()
# initialize the remodeled worth to None
remodeled = None
for idx, knowledge in enumerate(train_loader):
inputs = knowledge[0]
labels_next = knowledge[1]
if remodeled will not be None:
with torch.cuda.stream(model_stream):
labels = labels.to(gadget, non_blocking=True).squeeze()
model_stream.wait_stream(transform_stream)
optimizer.zero_grad()
output = mannequin(remodeled)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
with torch.cuda.stream(transform_stream):
inputs = inputs.to(gadget, non_blocking=True)
remodeled = batch_transform(inputs)
# File that the tensor was produced on transform_stream
remodeled.record_stream(transform_stream)
labels = labels_next
if idx == warmup:
torch.cuda.synchronize()
t0 = time.perf_counter()
if idx == total_iters:
break
torch.cuda.synchronize()
total_time = time.perf_counter() - t0
print(f'throughput: {active_batches / total_time}')
This additional improves the throughput to 38.82 steps per second — a ten.2% enhance over the serialized answer, and 90.20% quicker than the unique baseline
Sensitivity of Pipelining to Workload Properties
As we noticed in Half 1, the advantage of pipelining utilizing CUDA streams varies primarily based on the main points of the workload. Within the desk under, we seize the outcomes for a number of totally different batch sizes:

Because the batch measurement will increase, GPU offloading turns into simpler, considerably boosting efficiency. On the identical time, the positive aspects from pipelining lower. That is seemingly do to the actual fact bigger batch sizes enhance the GPU effectivity, decreasing the alternatives for overlap.
Abstract
On the subject of working AI/ML workloads, each millisecond counts. On this publish we explored the affect of pipelining an AI/ML coaching step utilizing CUDA stream in two frequent eventualities: partial mannequin coaching and offloading knowledge augmentations to the GPU. In each instances, the pipelined answer outperformed the serialized implementation — although the extent of the advance assorted considerably primarily based on the worth of the batch measurement.
As we’ve emphasised all through the publish, the anticipated affect of using CUDA streams can range tremendously primarily based on the AI/ML workload. For instance, in instances the place the GPU is already being effectively utilized, the overhead of utilizing CUDA streams may very well result in a degradation in runtime efficiency. We strongly advocate testing this system by yourself workloads earlier than adopting this strategy.
We hope you’ll find the approach described on this publish helpful. For extra tip, tips, and strategies for profiling and optimizing AI/ML workflows, take a look at the opposite posts on this sequence.