GPU Puzzlers

Accounting for FLOPS
Is the GPU faster at addition or multiplication?
The Faster Way to Add?
This puzzle presents three different ways to add elements of a tensor. Can you figure out the fastest implementation?
Order of Kernels
The order of operations matters on the GPU. Can you find the faster ordering?
Quantization Quirks
When is matrix multiplication compute bound and when is it memory bandwidth bound on a GPU?
Memorable Mysteries
What is the optimal way to do a matrix transpose on a GPU?
Swimming in Streams
Can GPUs communicate and compute at the same time?
To Fuse or Not to Fuse?
Can the arithmetic intensity of a program be increased?
Communication is the Key to Success
Data can be transmitted in many ways but, can you find the most efficient way?