Answer
Empirically the first_sum
implementation is the slowest and the third_sum
implementation is the
fastest.
PyTorch profiler trace for first_sum, second_sum and third_sum
The trace above shows that duration of each function and highlights the numerous device-to-host
memory copy kernels triggered in the first_sum
implementation and the various vector ops taking
place in the second_sum
function and the single device-to-host copy in third_sum
. The trace file is available
here.
Analyzing first_sum - numerous device to host copies
def first_sum(cuda_tensor):
total = 0.0
for i in range(cuda_tensor.size()[0]):
total += cuda_tensor[i].cpu()
return total
Zoomed view - first_sum
In the first_sum
function the number of copies from the device to the host is equal to the size of
the tensor. These are incurred due to the .cpu()
call in the for loop. Each .cpu()
call moves a
small amount of data from the GPU to the CPU and takes about $1$ microsecond. Additionally, the gap betweeen
consecutive kernels is about $40$ microseconds. These repeated calls,
highighted in the zoomed trace above make the first_sum
implementation the slowest one.
Analyzing second_sum - no device to host copies
def second_sum(cuda_tensor):
total = torch.zeros(1, device='cuda')
for i in range(cuda_tensor.size()[0]):
total += cuda_tensor[i]
return total
Zoomed view - second_sum
The second_sum
function initializes and places the total
tensor on the GPU. Addition of each
element of a_tensor
to total
triggers vector op kernels. The addition is slowed due to the launch
overhead of these kernels. The arithmetic intensity of vector ops involving small tensors is low and
such operations can be done on the CPU, when reasonable. Even though there are no instances of
device to host copies and synchronizations in the second_sum
implementation, the GPU is
considerably underutilized. This can be seen by the $\sim 20$ microsecond gaps between consecutive kernels
on stream $7$.
Analyzing third_sum - addition on the CPU
def third_sum(cuda_tensor):
total = 0.0
tensor_on_cpu = cuda_tensor.cpu()
for i in range(tensor_on_cpu.size()[0]):
total += tensor_on_cpu[i]
return total
Finally, the third_sum
implementation copies the entire tensor to the CPU and pays a small one
time cost to transfer data. Precisely speaking, $4 \times 4096$ bytes are transferred in $2$ microseconds.
Thus the achieved PCIe bandwidth is approximately $8$ GB/sec. The summation is done on the CPU as
total
and the elements of the tensor are on the CPU. It is extremely slow due to lack of
instruction-level and core-level parallelism and the additional PyTorch overhead makes it even
slower.
Discussion
What is the fastest way to add in PyTorch?
torch.sum
has better performance than the above implementations. The functions above are contrived
examples to demonstrate device-to-host synchronization and launch overhead. The table below
summarizes the time taken by the three functions above and torch.sum
.
function | CPU time (ms) | GPU time(ms) |
---|---|---|
first_sum | 183 | 181 |
second_sum | 87 | 86 |
third_sum | 25 | 0.001 |
torch.sum() (tensor on GPU) | 0.161 | 0.009 |
torch.sum() (tensor on CPU) | 1.8 | NA |
torch.sum()
uses Intel Math Kernel Library (MKL) to sum the tensors which fully utilizes the cores
and instruction-level parallelism to speed up the computation. Intel MKL is written in C++ so there
is no Python overhead either.
What is synchronization?
There are three levels of synchronization: Device synchronization, Stream synchronization and Event synchronization.
A stream is a sequence of operations that are performed in order on the device. Operations in different streams can be interleaved and in some cases overlapped - a property that can be used to hide data transfers between the host and the device. For more details on streams check out Swimming in Streams.
A CUDA event is a synchronization marker that can be used to monitor the device’s progress, to accurately measure timing, and to synchronize CUDA streams.
-
Using
torch.cuda.synchronize()
leads to device synchronization which blocks execution on the CPU thread until the device has completed all preceding tasks. It waits for the GPU to finish kernel execution on all streams. -
CUDA streams can be synchronized using
torch.cuda.Stream.synchronize()
. Execution is blocked on the CPU thread until the device has executed all kernels on the specified stream. -
Event synchronization can be triggered using
torch.cuda.Event.synchronize()
. It prevents the CPU thread from proceeding until the event is completed.
What are some naturally occurring synchronization points in PyTorch?
- Explicit call to
torch.cuda.synchronize()
. - Implicity the following calls trigger synchronization:
.item()
,.cpu()
,torch.nonzero
,torch.masked_select
. - Logging statements from the GPU.
Why are there thousands of calls to cudaStreamSynchronize in the trace?
-
Iterating over a for loop as in
first_sum
causes synchronization and leads to performance degradation. These synchronization points are visible and avoidable. A better implementation would parallelize the computation by launching multiple threads rather than iterating over it. -
As seen in the
second_sum
, an absence of synchronization points does not guarantee good performance. Executing multiple small kernels iteratively does not utilize the GPU completely. -
Synchronization points can stall the CUDA launch queue which can make the job CPU bound. More about this in the next post.
How can I find synchronization points in my program?
Use
torch.cuda.set_sync_debug_mode()
,
when possible. Currently, this feature does not support torch.distributed and torch.sparse
namespaces.
Analyzing traces with Holistic Trace Analysis
Looking at the trace, you may note that the cudaMemcpyAsync
event on the CPU has a longer duration
than the corresponding operation (MemcpyDtoH) on the GPU. This may not be easy to find in general.
The Holistic Trace Analysis (HTA)
library provides a convenient way to identify this behavior using the Cuda Kernel Launch
Statistics
feature. Using the PyTorch profiler traces, HTA provides insights for performance debugging. The
get_cuda_kernel_launch_stats
function displays the distribution of GPU kernels (in particular, cudaLaunchKernel
,
cudaMemcpyAsync
and cudaMemsetAsync
) whose duration is less than the corresponding CPU event.
Profiling each function as an independent python program we generated three traces and analyzed them
with the get_cuda_kernel_launch_stats
feature. One of the outputs from the
get_cuda_kernel_launch_stats
function are the graphs below which show that there are thousands of
GPU events with duration shorter than the corresponding CPU event, thus highlighting the issues with
the first_sum
and second_sum
implementations.
Cuda Kernel Launch Stats - first_sum
Cuda Kernel Launch Stats - second_sum
The graphs above were generated using the following code snippet:
from hta.trace_analysis import traceanalysis
analyzer = traceanalysis(trace_dir = "/path/to/trace/folder")
kernel_stats = analyzer.get_cuda_kernel_launch_stats()
Here are the traces for the first_sum, second_sum and third_sum functions.
What should you remember in years to come?
- Multiple small repeated kernel calls or device-to-host copies make your code perform poorly. They can often be replaced this with a single large compute/copy kernel.
- Be aware of device-to-host synchronization points since they can often be avoided.
Explore more
- Find the kernels taking the most time in your model using the Kernel Breakdown feature in Holistic Trace Analysis.
- Build a roofline model to find if the kernels are compute bound or memory bandwidth bound.