AI in A number of GPUs: Understanding the Host and System Paradigm

is a part of a collection about distributed AI throughout a number of GPUs:

Half 1: Understanding the Host and System Paradigm (this text)
Half 2: Level-to-Level and Collective Operations (coming quickly)
Half 3: How GPUs Talk (coming quickly)
Half 4: Gradient Accumulation & Distributed Information Parallelism (DDP) (coming quickly)
Half 5: ZeRO (coming quickly)
Half 6: Tensor Parallelism (coming quickly)

Introduction

This information explains the foundational ideas of how a CPU and a discrete graphics card (GPU) work collectively. It’s a high-level introduction designed that can assist you construct a psychological mannequin of the host-device paradigm. We’ll focus particularly on NVIDIA GPUs, that are essentially the most generally used for AI workloads.

For built-in GPUs, akin to these present in Apple Silicon chips, the structure is barely completely different, and it gained’t be coated on this submit.

The Large Image: The Host and The System

A very powerful idea to understand is the connection between the Host and the System.

The Host: That is your CPU. It runs the working system and executes your Python script line by line. The Host is the commander; it’s accountable for the general logic and tells the System what to do.
The System: That is your GPU. It’s a robust however specialised coprocessor designed for massively parallel computations. The System is the accelerator; it doesn’t do something till the Host offers it a job.

Your program all the time begins on the CPU. If you need the GPU to carry out a job, like multiplying two giant matrices, the CPU sends the directions and the information over to the GPU.

The CPU-GPU Interplay

The Host talks to the System by a queuing system.

CPU Initiates Instructions: Your script, working on the CPU, encounters a line of code meant for the GPU (e.g., tensor.to('cuda')).
Instructions are Queued: The CPU doesn’t wait. It merely locations this command onto a particular to-do record for the GPU known as a CUDA Stream — extra on this within the subsequent part.
Asynchronous Execution: The CPU doesn’t look ahead to the precise operation to be accomplished by the GPU, the host strikes on to the following line of your script. That is known as asynchronous execution, and it’s a key to attaining excessive efficiency. Whereas the GPU is busy crunching numbers, the CPU can work on different duties, like getting ready the following batch of information.

CUDA Streams

A CUDA Stream is an ordered queue of GPU operations. Operations submitted to a single stream execute so as, one after one other. Nevertheless, operations throughout completely different streams can execute concurrently — the GPU can juggle a number of impartial workloads on the identical time.

By default, each PyTorch GPU operation is enqueued on the present energetic stream (it’s often the default stream which is mechanically created). That is easy and predictable: each operation waits for the earlier one to complete earlier than beginning. For many code, you by no means discover this. But it surely leaves efficiency on the desk when you’ve work that might overlap.

A number of Streams: Concurrency

The traditional use case for a number of streams is overlapping computation with information transfers. Whereas the GPU processes batch N, you possibly can concurrently copy batch N+1 from CPU RAM to GPU VRAM:

Stream 0 (compute): [process batch 0]────[process batch 1]───
Stream 1 (information):   ────[copy batch 1]────[copy batch 2]───

This pipeline is feasible as a result of compute and information switch occur on separate {hardware} models contained in the GPU, enabling true parallelism. In PyTorch, you create streams and schedule work onto them with context managers:

compute_stream = torch.cuda.Stream()
transfer_stream = torch.cuda.Stream()

with torch.cuda.stream(transfer_stream):
    # Enqueue the switch on transfer_stream
    next_batch = next_batch_cpu.to('cuda', non_blocking=True)

with torch.cuda.stream(compute_stream):
    # This runs concurrently with the switch above
    output = mannequin(current_batch)

Notice the non_blocking=True flag on .to(). With out it, the switch would nonetheless block the CPU thread even once you intend it to run asynchronously.

Synchronization Between Streams

Since streams are impartial, it’s good to explicitly sign when one is dependent upon one other. The blunt software is:

torch.cuda.synchronize()  # waits for ALL streams on the system to complete

A extra surgical strategy makes use of CUDA Occasions. An occasion marks a particular level in a stream, and one other stream can wait on it with out halting the CPU thread:

occasion = torch.cuda.Occasion()

with torch.cuda.stream(transfer_stream):
    next_batch = next_batch_cpu.to('cuda', non_blocking=True)
    occasion.file()  # mark: switch is completed

with torch.cuda.stream(compute_stream):
    compute_stream.wait_event(occasion)  # do not begin till switch completes
    output = mannequin(next_batch)

That is extra environment friendly than stream.synchronize() as a result of it solely stalls the dependent stream on the GPU facet — the CPU thread stays free to maintain queuing work.

For day-to-day PyTorch coaching code you gained’t must handle streams manually. However options like DataLoader(pin_memory=True) and prefetching rely closely on this mechanism underneath the hood. Understanding streams helps you acknowledge why these settings exist and offers you the instruments to diagnose refined efficiency bottlenecks once they seem.

PyTorch Tensors

PyTorch is a robust framework that abstracts away many particulars, however this abstraction can generally obscure what is going on underneath the hood.

If you create a PyTorch tensor, it has two elements: metadata (like its form and information kind) and the precise numerical information. So once you run one thing like this t = torch.randn(100, 100, system=system), the tensor’s metadata is saved within the host’s RAM, whereas its information is saved within the GPU’s VRAM.

This distinction is necessary. If you run print(t.form), the CPU can instantly entry this info as a result of the metadata is already in its personal RAM. However what occurs in the event you run print(t), which requires the precise information residing in VRAM?

Host-System Synchronization

Accessing GPU information from the CPU can set off a Host-System Synchronization, a standard efficiency bottleneck. This happens at any time when the CPU wants a consequence from the GPU that isn’t but obtainable within the CPU’s RAM.

For instance, think about the road print(gpu_tensor) which prints a tensor that’s nonetheless being computed by the GPU. The CPU can not print the tensor’s values till the GPU has completed all of the calculations to acquire the ultimate consequence. When the script reaches this line, the CPU is compelled to block, i.e. it stops and waits for the GPU to complete. Solely after the GPU completes its work and copies the information from its VRAM to the CPU’s RAM can the CPU proceed.

As one other instance, what’s the distinction between torch.randn(100, 100).to(system) and torch.randn(100, 100, system=system)? The primary methodology is much less environment friendly as a result of it creates the information on the CPU after which transfers it to the GPU. The second methodology is extra environment friendly as a result of it creates the tensor immediately on the GPU; the CPU solely sends the creation command.

These synchronization factors can severely influence efficiency. Efficient GPU programming includes minimizing them to make sure each the Host and System keep as busy as attainable. In spite of everything, you need your GPUs to go brrrrr.

Picture by writer: generated with ChatGPT

Scaling Up: Distributed Computing and Ranks

Coaching giant fashions, akin to Giant Language Fashions (LLMs), typically requires extra compute energy than a single GPU can supply. Coordinating work throughout a number of GPUs brings you into the world of distributed computing.

On this context, a brand new and necessary idea emerges: the Rank.

Every rank is a CPU course of which will get assigned a single system (GPU) and a singular ID. In the event you launch a coaching script throughout two GPUs, you’ll create two processes: one with rank=0 and one other with rank=1.

This implies you’re launching two separate cases of your Python script. On a single machine with a number of GPUs (a single node), these processes run on the identical CPU however stay impartial, with out sharing reminiscence or state. Rank 0 instructions its assigned GPU (cuda:0), whereas Rank 1 instructions one other GPU (cuda:1). Though each ranks run the identical code, you possibly can leverage a variable that holds the rank ID to assign completely different duties to every GPU, like having every one course of a special portion of the information (we’ll see examples of this within the subsequent weblog submit of this collection).

Conclusion

Congratulations for studying all the best way to the top! On this submit, you discovered about:

The Host/System relationship
Asynchronous execution
CUDA Streams and the way they allow concurrent GPU work
Host-System synchronization

Within the subsequent weblog submit, we’ll dive deeper into Level-to-Level and Collective Operations, which allow a number of GPUs to coordinate advanced workflows akin to distributed neural community coaching.

AI in A number of GPUs: Understanding the Host and System Paradigm

AI meets HR: Reworking expertise acquisition with Amazon Bedrock

Construct long-running MCP servers on Amazon Bedrock AgentCore with Strands Brokers integration

Construct long-running MCP servers on Amazon Bedrock AgentCore with Strands Brokers integration

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

The Good-Sufficient Fact | In direction of Knowledge Science

About Us

Category

Recent Posts