with calculating an software’s efficiency is that the real-world efficiency and theoretical efficiency can differ. With an ecosystem of merchandise that’s rising with excessive efficiency wants similar to Excessive Efficiency Computing (HPC), gaming, or within the present panorama – Giant Language Fashions (LLMs), it’s important to calculate precisely the efficiency of an software.
Merely measuring theoretical GFLOPs (Floating-Level Operations Per Second) just isn’t sufficient, as purposes not often attain these maximums in the true world. That is the place the Roofline Mannequin is available in, providing a transparent visible technique to estimate an software’s efficiency and highlighting the crucial function of hardware-specific optimizations.
Why easy metrics aren’t sufficient
Once we take into consideration measuring efficiency, there are just a few metrics that come to thoughts:
- Execution time: This tells you how lengthy a activity took however presents no perception into why.
- Cycles per Directions (CPI): This only measures the processor’s compute efficiency.
- Serial vs Parallel execution: Measures compute efficiency overlooking any {hardware} optimizations.
- Floating Level Operations Per Second (FLOP/s): This only represents a theoretical most which is commonly not achievable in a real-world situation.
Whereas these are good metrics, they typically don’t present sufficient info. For example, utilizing the Floating Level Operations Per Seconds is a theoretical restrict which isn’t usually achieved. So utilizing that because the solely metric just isn’t sufficient because it ignores a typical efficiency limiter – information motion.
Roofline Modeling
The Roofline Mannequin is a robust software that visually maps an software’s efficiency in opposition to the capabilities of a selected {hardware} structure, similar to a CPU or GPU. The mannequin will get its identify from the form of the graph it produces, which incorporates a “roof” composed of a slanted line and a flat, horizontal line. This form represents the last word efficiency limits imposed by the {hardware}.
From this modeling approach, there are two parameters which outline the achievable limits with {hardware}:
- Information motion: The time it takes to maneuver information, calculated as the overall information dimension divided by the system’s peak reminiscence bandwidth.
- Computation: The time required for calculations, decided by dividing the overall variety of floating-point operations by the system’s peak compute efficiency (generally measured in GFLOP/s).
The full execution time of an software is decided by the higher of those two values: max {data_movement, computation}
.
Regardless of the {hardware} having higher compute efficiency, information motion can usually grow to be the bottleneck. Roofline Modeling introduces the idea of Arithmetic Depth (AI). AI is the ratio of floating-point operations carried out for each byte of information moved from reminiscence.
- An algorithm with excessive Arithmetic Depth is taken into account compute-hungry. Its efficiency is proscribed by how shortly calculations may be carried out.
- An algorithm with low Arithmetic Depth is taken into account data-hungry. Its efficiency is proscribed by how shortly information may be moved.
Understanding the graph
Artistic Commons Attribution-Share Alike 4.0 Worldwide
A Roofline graph plots the Attainable FLOP/s (y-axis) in opposition to the Arithmetic Depth (x-axis). The “roof” itself reveals the {hardware}’s limitations. The slanted a part of the roof represents the height information bandwidth (in GB/s), whereas the flat half represents the height computational efficiency (in GFLOPS). Notice that the whole lot within the picture is in a logarithmic scale.
- Factors beneath the roof: Point out suboptimal efficiency indicating scope of enchancment.
- Factors hitting the slanted line: Information hungry software. Its efficiency is proscribed by information bandwidth.
- Factors hitting the flat line: Compute hungry software. It’s utilizing the complete computational energy of the processor.
Why is Roofline Modeling necessary?
Roofline Modeling supplies a visible, intuitive approach to perceive software efficiency, exhibiting key traits like Operational Depth, GPU capabilities, and attainable FLOP/s. This sort of modeling helps the programmer make focused optimizations to their software for {hardware} with which higher outcomes may be obtained.
- Bottleneck evaluation: Having a visible assist makes it straightforward for the developer to determine the place the bottleneck is – reminiscence or efficiency. If the appliance is reminiscence intensive, a developer can deal with enhancing information locality with strategies like caching or loop tiling. If it’s compute intensive, the main target can shift to enabling extra parallel computations or leveraging compiler optimizations.
- {Hardware} and software program design: Software program engineers shouldn’t concern the underlying {hardware}. As an alternative, the {hardware} design must be embraced and optimized. Software program engineers can use insights from Roofline Modeling to embrace and optimize for the precise structure they’re utilizing.
Roofline Modeling in Motion
To carry out Roofline Modeling, we have to profile the appliance to grasp the efficiency. From profiling, we are able to get metrics similar to Floating Level Operations (FLOPs) and reminiscence bandwidth utilization, each of that are required for Roofline Modeling. This text explores two of those instruments – Nvidia’s ncu
which is the Nsight Compute CLI for GPU evaluation and PyTorch’s profiler, particularly for purposes utilizing PyTorch.
For detailed CUDA kernel optimization and exact FLOP/byte calculations, ncu
supplies direct GPU {hardware} counter info. In distinction, torch.profiler.profile
presents a higher-level perspective inside PyTorch, serving to within the understanding of operator-level efficiency, tensor reminiscence utilization, and the general software habits encompassing each CPU and GPU actions.
Profiling with ncu
ncu
is the command line interface which is used for profiling CUDA kernels [2]. It could show outcomes instantly within the terminal or save them to a log file for later evaluation. To construct a Roofline mannequin, we have to seize the precise metrics that can enable us to calculate Arithmetic Depth.
We’ll use the PyTorch ImageNet repository [3] as our instance. It’s a good selection as a result of it’s straightforward to grasp, well-documented by PyTorch, and works with their profiler, so we are able to actually dig into the efficiency.
Step 1: Run the ncu command to gather metrics
Step one is to run the appliance via ncu to gather the required hardware-level information. The command appears like this:
ncu --log-file
--metrics
--target-processes all
python3
- log-file: The log file during which we need to retailer the outcomes.
- metrics: That is a very powerful parameter and depicts the metrics that we need to seize. For calculating Arithmetic Depth, we take into account:
dram__sectors_write.sum
: sum of DRAM sectors writtendram__sectors_read.sum
: sum of DRAM sectors learnsmsp__sass_thread_inst_executed_op_fadd_pred_on.sum
: sum of floating-point additionssmsp__sass_thread_inst_executed_op_fmul_pred_on.sum
: sum of floating-point multiplicationssmsp__sass_thread_inst_executed_op_ffma_pred_on.sum
: sum of floating-point fused multiply add operations
- target-process:
all
flag ensures that we profile your entire software.
Our ncu command adjustments to:
ncu --log-file logs_example --metrics dram__sectors_write.sum,
dram__sectors_read.sum,
smsp__sass_thread_inst_executed_op_fadd_pred_on.sum,
smsp__sass_thread_inst_executed_op_fmul_pred_on.sum,
smsp__sass_thread_inst_executed_op_ffma_pred_on.sum
--target-processes all python3
primary.py /imagenet --arch resnet50 --epochs 1 --batch-size 10
--print-freq 10 --seed 42
Step 2: Calculating FLOPs from the metrics
As soon as the profiler has run, we are able to mixture the collected metrics to calculate the overall floating-point operations. The formulation is:
[FLOPs = 2 * FMA_count + FADD_count + FMUL_count]
- FLOPs: Rely of Floating Level Operations.
- FMA_count: Fused Multiply-Add (FMA) operations usually depend as 2 FLOPs (one multiplication and one addition). That is represented by the
smsp__sass_thread_inst_executed_op_ffma_pred_on.sum
metric. - FADD_count: That is represented by the
smsp__sass_thread_inst_executed_op_fadd_pred_on.sum
metric. - FMUL_count: That is represented by the
smsp__sass_thread_inst_executed_op_fmul_pred_on.sum
metric.
Step 3: Calculate the bytes transferred
Subsequent, we calculate the overall information transferred to and from DRAM. The ncu metrics present the variety of DRAM sectors learn and written. Assuming a typical sector dimension of 32 bytes for contemporary GPUs:
[Total_DRAM_bytes = (dram__sectors_read.sum + dram__sectors_write.sum) * 32]
Step 4: Calculate the Arithmetic Depth
With FLOPs and complete bytes, we are able to now calculate the Arithmetic Depth:
[AI = FLOPs / Total_DRAM_Bytes]
Step 5: Calculate execution time
To seek out the appliance’s efficiency in FLOP/s, we additionally want the execution time. For this, we are able to use NVIDIA Nsight Methods (nsys), a system-wide profiler that may precisely measure the runtime of software segments. We run our software once more, this time with nsys, to generate a time-based report. From this report, we are able to extract the overall GPU operating time.
nsys profile -f true -o python3
Our nsys command adjustments to:
nsys profile -f true -o time.qdrep python3 primary.py /imagenet
--arch resnet50 --epochs 1 --batch-size 10 --print-freq 10
--seed 42
After operating this command, we are able to get the GPU_RUNNING_TIME
.
Step 6: Calculate the appliance efficiency
Lastly, we calculate the achieved efficiency in FLOP/s by dividing the overall FLOPs by the execution time:
[FLOP/s = FLOPs / GPU_RUNNING_TIME]
This worth offers us the “attainable FLOP/s” that we are able to plot on our Roofline graph.
Profiling with torch
For purposes written in PyTorch, the built-in torch.profiler.profile
presents a user-friendly approach to collect efficiency information. There are 2 choices which are offered to the builders:
- Use the Profiler Context Supervisor
- Focusing on Profiling for particular neural community layers
Profiler Context Supervisor
The a part of the code that we need to profile may be wrapped throughout the with torch.profiler.profile()
context supervisor. Within the with
assertion, you’ll be able to outline the actions
to hint (CPU, CUDA, or each), set a schedule
to profile particular coaching steps, and select whether or not to file tensor shapes, reminiscence utilization, or FLOPs. As soon as contained in the context, you could name prof.step()
on the finish of every iteration to sign the profiler to advance, particularly when a schedule is used.
with profile(
actions=,
schedule=torch.profiler.schedule(),
record_shapes=,
profile_memory=,
with_flops=
) as prof:
....
prof.step()
- actions: Specify whether or not to profile the CPU, CUDA or each.
- schedule: Helpful for profiling a number of steps within the coaching loop. If the schedule parameter is used, the profiler must name prof.step() to maneuver to the following step.
- record_shapes: Whether or not to file the shapes of the tensors.
- profile_memory: To seize reminiscence utilization
- with_flops: That is experimental however is used to FLOPs with operators.
Our profiler command adjustments to:
with profile(
actions=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(wait=1, warmup=1, lively=3, repeat=2),
record_shapes=True,
profile_memory=True,
with_flops=True
) as prof:
Focusing on Profiling for particular neural community layers
The profiler can be utilized in a extra focused method to investigate particular layers of a neural community. That is helpful to verify whether or not some particular layer is contributing extra to the efficiency than the opposite layers giving the developer the choice of modifying particular layers. Whereas utilizing that is very straightforward to make use of, generally, the primary possibility works higher. The PyTorch profiler outcomes can be exported and visualized on a TensorBoard.
profiler.begin()
self.conv2(x)
profiler.cease()
LLMs and Roofline Modeling
Coming to the subject everybody has been ready for – does Roofline Modeling assist with LLM efficiency calculation? The quick reply is sure.
LLMs are complicated neural community architectures with billions of parameters and the large datasets that they course of. Whereas coaching is a really resource-intensive activity, inference and superb tuning the mannequin additionally have to be environment friendly.
- Bottlenecks: LLMs throughout inference can endure from bottlenecks because of the sheer quantity of parameters that it’s working with. These parameters are the weights of the fashions they usually trigger reminiscence bandwidth points. Utilizing Roofline Modeling, the precise layers may be profiled for the bottlenecks.
- {Hardware} choice: As most organizations fine-tune current fashions fairly than coaching them from scratch, choosing the proper infrastructure is essential for managing prices. This underscores the significance of selecting optimum infrastructure for coaching. For instance, selecting the {hardware} in accordance with your LLM structure or optimizing your mannequin to run on a selected structure can reduce coaching and inference prices.
Conclusion
The Roofline Mannequin presents a robust visible evaluation of software efficiency optimization. By visualizing the appliance efficiency throughout reminiscence and compute, a transparent steering is offered in selecting one of the best ways to strategy optimizations. Whereas this text solely thought-about Naive Roofline Fashions, there are extra superior strategies similar to Hierarchical Roofline Fashions or including ceilings for particular compute optimizations.
References
[1] https://docs.nersc.gov/instruments/efficiency/roofline/
[2] https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html
[3] https://github.com/pytorch/examples/tree/primary/imagenet