within the knowledge enter pipeline of a machine studying mannequin working on a GPU will be notably irritating. In most workloads, the host (CPU) and the machine (GPU) work in tandem: the CPU is accountable for getting ready and feeding knowledge, whereas the GPU handles the heavy lifting — executing the mannequin, performing backpropagation throughout coaching, and updating weights.
In a super state of affairs, we would like the GPU — the most costly part of our AI/ML infrastructure — to be extremely utilized. This results in quicker improvement cycles, decrease coaching prices, and decreased latency in deployment. To attain this, the GPU should be constantly fed with enter knowledge. Particularly, we wish to forestall the onset of “GPU hunger” — a state of affairs through which our most costly useful resource lays idle whereas it waits for enter knowledge. Sadly, “GPU hunger” as a result of bottlenecks within the knowledge enter pipeline is kind of widespread and may dramatically cut back system effectivity. As such, it’s necessary for AI/ML builders to have dependable instruments and methods for diagnosing and addressing such points.
This submit — the eighth in our sequence on the subject of PyTorch Mannequin Efficiency Evaluation and Optimization — introduces a easy caching technique for figuring out bottlenecks within the knowledge enter pipeline. As in earlier posts, we intention to bolster two key concepts:
- AI/ML builders should take duty for the runtime efficiency of their fashions.
- You do not want to be a CUDA or techniques skilled to implement important efficiency optimizations.
We’ll begin by outlining a number of the widespread causes of GPU hunger. Then we’ll introduce our caching-based technique for figuring out and analyzing enter pipeline efficiency points. We’ll shut by reviewing a set of sensible instruments, tips, and strategies (TTTs) for overcoming efficiency bottlenecks within the knowledge enter pipeline.
To facilitate our dialogue we are going to outline a toy PyTorch mannequin and an related knowledge enter pipeline. The code that we are going to share is meant for demonstrative functions — please don’t depend on its correctness or optimality. Moreover, please don’t our point out of any device, or method as an endorsement of its use.
A Toy PyTorch Mannequin
We outline a easy PyTorch-based picture classification mannequin mannequin:
undefined
We outline an artificial dataset with a lot of transformations — deliberately designed to incorporate a extreme enter pipeline bottleneck. For extra particulars on the dataset definition please see this submit.
import numpy as np
from PIL import Picture
from torchvision.datasets.imaginative and prescient import VisionDataset
import torchvision.transforms as T
class FakeDataset(VisionDataset):
def __init__(self, rework):
tremendous().__init__(root=None, rework=rework)
self.measurement = 10000
def __getitem__(self, index):
# create a random 1024x1024 picture
img = Picture.fromarray(np.random.randint(
low=0,
excessive=256,
measurement=(input_img_size[0], input_img_size[1], 3),
dtype=np.uint8
))
# create a random label
goal = np.random.randint(low=0, excessive=num_classes,
dtype=np.uint8).merchandise()
# Apply tranformations
img = self.rework(img)
return img, goal
def __len__(self):
return self.measurement
class RandomMask(torch.nn.Module):
def __init__(self, ratio=0.25):
tremendous().__init__()
self.ratio=ratio
def dilate_mask(self, masks):
# carry out 4 neighbor dilation on masks
from scipy.sign import convolve2d
dilated = convolve2d(masks, [[0, 1, 0],
[1, 1, 1],
[0, 1, 0]], mode='identical').astype(bool)
return dilated
def ahead(self, img):
masks = np.random.uniform(measurement=(img_size, img_size)) < self.ratio
dilated_mask = torch.unsqueeze(torch.tensor(self.dilate_mask(masks)),0)
dilated_mask = dilated_mask.develop(3,-1,-1)
img[dilated_mask] = 0.
return img
class ConvertColor(torch.nn.Module):
def __init__(self):
tremendous().__init__()
self.A=torch.tensor(
[[0.299, 0.587, 0.114],
[-0.16874, -0.33126, 0.5],
[0.5, -0.41869, -0.08131]]
)
self.b=torch.tensor([0.,128.,128.])
def ahead(self, img):
img = img.to(dtype=torch.get_default_dtype())
img = torch.matmul(self.A,img.view([3,-1])).view(img.form)
img = img + self.b[:,None,None]
return img
class Scale(object):
def __call__(self, img):
return img.to(dtype=torch.get_default_dtype()).div(255)
rework = T.Compose(
[T.PILToTensor(),
T.RandomCrop(img_size),
RandomMask(),
ConvertColor(),
Scale()])
train_set = FakeDataset(rework=rework)
train_loader = torch.utils.knowledge.DataLoader(train_set, batch_size=256,
num_workers=4, pin_memory=True)
Subsequent, we outline the mannequin, loss perform, optimizer, coaching step, and coaching loop, which we wrap with a PyTorch Profiler context supervisor to seize efficiency knowledge.
from statistics import imply, variance
from time import time
machine = torch.machine("cuda:0")
mannequin = Internet().cuda(machine)
criterion = nn.CrossEntropyLoss().cuda(machine)
optimizer = torch.optim.SGD(mannequin.parameters(), lr=0.001, momentum=0.9)
def train_step(mannequin, criterion, optimizer, inputs, labels):
outputs = mannequin(inputs)
loss = criterion(outputs, labels)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
mannequin.practice()
t0 = time()
instances = []
with torch.profiler.profile(
schedule=torch.profiler.schedule(wait=10, warmup=2, energetic=10, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler('/tmp/prof'),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for step, knowledge in enumerate(train_loader):
# copy knowledge to machine
inputs = knowledge[0].to(machine=machine, non_blocking=True)
labels = knowledge[1].to(machine=machine, non_blocking=True)
# run practice step
train_step(mannequin, criterion, optimizer, inputs, labels)
prof.step()
instances.append(time()-t0)
t0 = time()
if step >= 100:
break
print(f'common time: {imply(instances[1:])}, variance: {variance(instances[1:])}')
For our experiments, we use an Amazon EC2 g5.xlarge occasion (containing an NVIDIA A10G GPU and 4 vCPUs) working a PyTorch (2.6) Deep Studying AMI (DLAMI). Operating our toy script on this surroundings leads to a median throughput of 0.89 steps per second, an underwhelming GPU utilization of twenty-two%, and within the following profiling hint:

As mentioned intimately in a earlier submit, the profiling hint exhibits a transparent sample of GPU hunger — the place the GPU spends most of its time ready for knowledge from the PyTorch DataLoader. This implies that there’s a efficiency bottleneck within the knowledge enter pipeline, which prevents enter batches from being ready rapidly sufficient to maintain the GPU absolutely occupied. Importantly, enter pipeline efficiency points can stem from a wide range of sources. Within the case of our toy instance, the reason for the bottleneck shouldn’t be obvious from the hint captured above.
A short observe for readers/builders that (regardless of all of our lecturing) stay aversive to the usage of PyTorch Profiler: The information caching-based method we are going to focus on under will current an alternate method of figuring out GPU hunger — so don’t despair.
GPU Hunger — Discovering the Root Trigger
On this part, we briefly evaluate widespread causes of efficiency bottlenecks on the enter knowledge pipeline.
Recall, that in a typical mannequin execution move:
- Uncooked knowledge is is loaded or streamed from storage (e.g., native RAM or disk, a distant community file system, or a cloud-based object retailer resembling Amazon S3 or Google Cloud Storage).
- It’s then preprocessed on the CPU.
- Lastly, the processed knowledge is copied to the GPU for inference or coaching.
Correspondingly, bottlenecks can emerge at every of the next levels:
- Gradual knowledge retrieval: There are a number of components that may restrict how rapidly uncooked knowledge will be retrieved by the CPU, together with: the selection of storage backend (e.g., cloud storage vs. native SSD), the out there community bandwidth, the information format, and extra.
- CPU useful resource exhaustion or misuse: Preprocessing duties — resembling knowledge augmentation, picture transformations, or decompression — will be CPU-intensive. When the quantity or complexity of those operations exceeds the out there CPU capability, or if the CPU sources are managed inefficiently (e.g., an in-optimal selection of variety of staff), a bottleneck can happen. It’s value noting that CPUs are additionally accountable for different model-related duties like loading GPU kernels, reminiscence administration, metric reporting, and extra.
- Host-to-device switch bottlenecks: As soon as knowledge is processed, it should be transferred to the GPU. This could turn out to be a bottleneck if knowledge batches are giant relative to the CPU-GPU reminiscence bandwidth, or if the reminiscence copying is carried out inefficiently (e.g., particular person samples are copied fairly than full batches).
The Limitation of Efficiency Profilers
A standard technique to establish knowledge pipeline bottlenecks is through the use of a efficiency profiler. Partially 4 of this sequence, Fixing Bottlenecks on the Knowledge Enter Pipeline with PyTorch Profiler and TensorBoard, we demonstrated how to do that utilizing PyTorch’s built-in profiler. Nonetheless, provided that the enter knowledge pipeline runs on the CPU, any Python profiler might be used.
The issue with this method is that we usually use a number of employee processes for knowledge loading, making efficiency profiling notably advanced. In our earlier submit, we overcame this by working the data-loading and the mannequin execution in a single course of (i.e., we set the num_workers argument of the DataLoader constructor to zero). Nonetheless, this can be a extremely intrusive configuration change that may have a major influence on the general efficiency of our mannequin.
The caching-based methodology we current on this submit goals to pinpoint the supply of the efficiency bottleneck in a far much less intrusive method. Particularly, it would allow us to measure the mannequin efficiency with out altering the multi-worker data-loading conduct.
Bottleneck Detection by way of Caching
On this part, we suggest a multi-step method for analyzing the efficiency of the enter knowledge pipeline. We’ll exhibit how this methodology will be utilized to our toy coaching workload to establish the causes of the GPU hunger.
Step 1: Cache a Batch on the System
We start by making a single enter batch, copying it to the GPU, after which measuring the runtime efficiency of the mannequin when iterating over simply that batch. This supplies a theoretical higher sure on the mannequin’s throughput — i.e., the utmost throughput achievable when the GPU shouldn’t be data-starved.
Within the following code block we modify the coaching loop of our toy script in order that it runs on a single batch that’s cached on the GPU:
knowledge = subsequent(iter(train_loader))
inputs = knowledge[0].to(machine=machine, non_blocking=True)
labels = knowledge[1].to(machine=machine, non_blocking=True)
t0 = time()
instances = []
for step in vary(100):
train_step(mannequin, criterion, optimizer, inputs, labels)
instances.append(time()-t0)
t0 = time()
The resultant common throughput is 3.45 steps per second — practically 4 instances increased than our baseline outcome. Not solely does this verify a major knowledge pipeline bottleneck, but it surely additionally quantifies its influence.
Bonus Tip: Profile and Optimize with System-Cached Knowledge
Operating a profiler on a single batch cached on the GPU isolates the mannequin execution from the enter pipeline. This helps you establish inefficiencies within the mannequin’s uncooked compute path. Ideally, GPU utilization right here ought to method 100%. In our case, utilization is round 95%, which is appropriate.
Step 2: Cache a Batch on the Host (CPU)
Subsequent, we cache a single enter batch on the host (CPU) as a substitute of the machine. Now, every step consists of each a reminiscence copy from CPU to GPU and the mannequin execution.
Since PyTorch’s reminiscence pinning permits for asynchronous knowledge transfers, we anticipate the host-to-device reminiscence copy for batch N+1 to overlap with the mannequin execution on batch N. Consequently, our expectation is that the throughput will probably be in the identical ballpark as within the device-cached case. If not, this could be a transparent indication of a bottleneck within the host to machine reminiscence copy.
The next block of code comprises our utility of this step to our toy mannequin:
knowledge = subsequent(iter(train_loader))
t0 = time()
instances = []
for step in vary(100):
inputs = knowledge[0].to(machine=machine, non_blocking=True)
labels = knowledge[1].to(machine=machine, non_blocking=True)
train_step(mannequin, criterion, optimizer, inputs, labels)
instances.append(time()-t0)
t0 = time()
The resultant throughput following this variation is 3.33 steps per second — a minor drop from the earlier outcome — indicating that the host-to-device switch shouldn’t be a bottleneck. We have to maintain searching for the supply of our efficiency bottleneck.
Steps 3 and on: Cache at Intermediate Phases within the Knowledge Pipeline
We proceed our search by “climbing” up the information enter pipeline, caching at numerous intermediate factors to pinpoint the bottleneck. The exact utility of this course of will range based mostly on the small print of the pipeline. Suppose the pipeline will be damaged into Ok levels. If caching after stage N yields a considerably worse throughput when caching after stage N+1, we will deduce that that the inclusion of the processing of stage N+1 is what’s slowing us down.
Step 3a: Cache a Single Processed Pattern
Within the code block under, we modify our dataset to cache one absolutely processed pattern. This simulates a pipeline that features knowledge collation and the CPU to GPU knowledge copy.
class FakeDataset(VisionDataset):
def __init__(self, rework):
tremendous().__init__(root=None, rework=rework)
self.measurement = 10000
self.cache = None
def __getitem__(self, index):
if self.cache is None:
# create a random 1024x1024 picture
img = Picture.fromarray(np.random.randint(
low=0,
excessive=256,
measurement=(input_img_size[0], input_img_size[1], 3),
dtype=np.uint8
))
# create a random label
goal = np.random.randint(low=0, excessive=num_classes,
dtype=np.uint8).merchandise()
# Apply tranformations
img = self.rework(img)
self.cache = img, goal
return self.cache
The resultant throughput is 3.23 steps per second— nonetheless far increased than our baseline of 0.89. We nonetheless haven’t discovered the wrongdoer.
Step 3b: Cache Uncooked Knowledge (Earlier than Transformation)
Subsequent, we modify the dataset in order to cache the uncooked knowledge (e.g., unprocessed picture information). The enter knowledge pipeline now consists of the information transformations, knowledge collation, and the CPU to GPU knowledge copy.
class FakeDataset(VisionDataset):
def __init__(self, rework):
tremendous().__init__(root=None, rework=rework)
self.measurement = 10000
self.cache = None
def __getitem__(self, index):
if self.cache is None:
# create a random 1024x1024 picture
img = Picture.fromarray(np.random.randint(
low=0,
excessive=256,
measurement=(input_img_size[0], input_img_size[1], 3),
dtype=np.uint8
))
# create a random label
goal = np.random.randint(low=0, excessive=num_classes,
dtype=np.uint8).merchandise()
self.cache = img, goal
# Apply tranformations
img = self.rework(self.cache[0])
return img, self.cache[1]
This time, the throughput drops sharply — all the way in which right down to 1.72 steps per second. We have now discovered our first wrongdoer: the information transformation perform.
Interim Outcomes
Right here’s a abstract of the experiments thus far:

The outcomes level to a major slowdown launched by the information transformation step. The hole between the uncooked knowledge caching outcome and the baseline additionally means that uncooked knowledge loading could also be one other wrongdoer. Let’s start with the information processing bottleneck.
Optimizing the Knowledge Transformation
We now proceed with our newfound discovery of a efficiency bottleneck within the knowledge processing perform. The following logical step can be to interrupt the rework perform into particular person parts and apply our caching method to every one with the intention to derive extra perception into the exact sources of our GPU hunger. For the sake of brevity, we are going to skip forward and apply the information processing optimizations mentioned in our earlier submit, Fixing Bottlenecks on the Knowledge Enter Pipeline with PyTorch Profiler and TensorBoard. Please see there for particulars.
Following the information transformation optimizations, the throughput of the cached uncooked knowledge experiment shoots as much as 3.23. We have now eradicated the bottleneck within the knowledge processing perform.
Nonetheless, our new baseline throughput (with out caching) turns into 1.28 steps per second, indicating that there stays a bottleneck within the uncooked knowledge loading. That is just like the tip outcome we reached in our earlier submit.

Optimizing Uncooked Knowledge Loading
To resolve the remaining bottleneck, we simulate the optimization demonstrated partly 5 of this sequence, Optimize Your DL Knowledge-Enter Pipeline with a Customized PyTorch Operator. We do that by decreasing the dimensions of our preliminary random picture from 1024×1024 to 256×256. Following, this variation the tip to finish (un-cached) coaching step will increase to three.23. Drawback solved!
Essential Caveats
We conclude with a number of necessary notes and caveats.
- A drop in throughput ensuing from inclusion of a sure data-processing step within the knowledge pipeline, doesn’t essentially imply that it’s that particular step that requires optimization. It’s completely doable that it’s one other step CPU utilization close to the restrict, and the brand new step simply tipped it over.
- In case your enter knowledge varies in measurement, throughput from a single cached knowledge pattern or batch of samples might not mirror real-world efficiency.
- The identical caveat applies if the AI mannequin consists of dynamic, data-dependent , options, e.g., if parts of the mannequin graph are depending on the enter knowledge.
Ideas, Tips, and Methods for Addressing Bottlenecks on the Knowledge Enter Pipeline
We conclude this submit with an inventory of suggestions, tips, and strategies for optimizing the information enter pipeline of PyTorch-based AI fashions. This checklist is certainly not exhaustive — quite a few extra optimizations exist relying in your particular use case and infrastructure. We divide the optimizations into three classes:
- Optimizing Uncooked Knowledge Entry/Retrieval
- Optimizing Knowledge Processing
- Optimizing Host-to-System Knowledge Switch
Optimizing Uncooked Knowledge Entry/Retrieval
Environment friendly knowledge loading begins with quick and dependable entry to uncooked knowledge. The next suggestions will help:
- Select an occasion kind with enough community ingress bandwidth.
- Use a quick and cost-effective knowledge storage resolution. Native SSDs are quick however costly. Cloud-based options like S3 supply scalability, however might introduce latency.
- Maximize storage community egress. Take into account partitioning datasets in S3 or tuning parallel downloads to scale back throttling.
- Take into account uncooked knowledge compression. Compressing information can cut back switch time — however be careful for elevated CPU price throughout decompression.
- Group small samples into bigger information. This could cut back overhead related to opening and shutting many information.
- Use optimized knowledge switch instruments. For instance, s5cmd can considerably outperforms AWS CLI for bulk S3 downloads.
- Tune knowledge retrieval parameters. Adjusting chunk measurement or concurrency settings can enormously influence learn efficiency.
Addressing Knowledge Processing Bottlenecks
- Tune the variety of knowledge loading staff and prefetch issue.
- At any time when doable, offload data-processing to the information preparation part.
- Select an occasion kind with an optimum CPU/GPU compute ratio.
- Optimize the order of transformations. For instance, making use of a crop earlier than blurring will probably be quicker blurring the complete sized picture and solely then cropping.
- Leverage Python acceleration libraries. For instance, Numba and JAX can velocity up pure Python operations by way of JIT compilation.
- Create customized PyTorch CPU operators the place applicable (e.g., see right here).
- Take into account including auxiliary CPUs (knowledge servers) — (e.g., see right here).
- Transfer GPU-friendly transforms to the GPU graph. Some transforms (e.g., normalization) will be performed post-loading on the GPU for higher overlap.
- Tune OS-level thread and reminiscence configurations.
Optimizing the Host to System Knowledge Copy
- Use reminiscence pinning and non-blocking knowledge copies to prefetch knowledge immediately onto the GPU. Additionally see the devoted CudaDataPrefetcher provided by TorchTNT.
- Postpone int8 to float32 datatype conversions to the GPU to scale back reminiscence copy payload by an element of 4.
- In case your mannequin is utilizing decrease precision floats (e.g., fp16/bfloat16) forged the floats on the CPU to scale back payload by half.
- Postpone unpacking of one-hot vectors to the GPU — i.e., maintain them as label ids till the final doable second.
- When you have many binary values, think about using bitmasks to compress the payload. For instance, when you’ve got 8 binary maps, take into account compressing them right into a single uint8.
- In case your enter knowledge is sparse, think about using sparse knowledge representations.
- Keep away from pointless padding. Whereas zero-padding is a well-liked method for coping with variable sized enter samples, it could considerably enhance the dimensions of the reminiscence copy. Take into account various choices (e.g., see right here).
- Be sure you are usually not copying knowledge that you don’t really need on the GPU!!
Abstract
Whereas GPUs are thought of important for modern-day AI/ML improvement they arrive at a steep worth. When you’ve determined to make the required funding into their acquisition, it would be best to be certain they’re getting used as a lot as doable. The very last thing you need is in your GPU to take a seat idle, ready for enter knowledge as a result of a preventable bottleneck elsewhere within the pipeline.
Sadly, such inefficiencies are all too widespread. On this submit, we launched a easy method for diagnosing these points by iteratively caching knowledge at totally different levels of the enter pipeline. By isolating the runtime influence of every pipeline part, this methodology helps establish particular bottlenecks — whether or not in uncooked knowledge loading, preprocessing, or host-to-device switch.
After all, the precise implementation will range throughout initiatives and pipelines, however we hope this technique supplies a helpful framework for diagnosing and resolving efficiency points in your individual AI/ML workflows.