Nvidia GPU Terminology

Block: a 1D/2D/3D collection of threads.

CUDA: Compute Unified Device Architecture.

CUDA Core: a single-precision (fp32) floating-point unit.

CUDA Stream: a sequence of operations that execute in issue-order on the GPU.

CUDA Thread: a lightweight thread that executes a sequential program. It the smallest unit of

Device: an alias for GPU.

Execution Configuration: a kernel invocation has to specify the grid dimension and block dimension; and, optionally, the shared memory size (default size = 0B) and stream (default = null stream). These parameters are called the execution configuration and are specified within the angle brackets <<< >>>.

Grid: a 1D/2D/3D collection of grids.

Host: alias for CPU.

Kernel: a function executed on the GPU. Its arguments are primitive types and pointers, and cannot be larger than 1 KB.

NVLink: an interconnection technology used to connect GPUs within a server.

NVSwitch: a switch used to connect GPUs within a single server. The connection between the switch and device is NVLink.

PTX Instruction: an instruction specified by a CUDA thread.

SIMT (Single instruction, multiple threads): an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading. It is different from SPMD in that all instructions in all threads are executed in lock-step.

SM (Streaming Multiprocessor): a multithreaded SIMT/SIMD processor which executes warps of CUDA threads.

Warp: a collection of parallel CUDA threads in a thread block.

GPU Memory

Global memory: also known as DRAM. It is accessible by all CUDA threads in any block in any grid.

Local memory: a region of DRAM that is accessible only to a specific CUDA thread.

Registers: private registers for a CUDA thread.

Shared memory: on-chip memory shared by CUDA threads. Since shared memory is on chip, and is built using SRAM, it has higher bandwidth and lower latency compared to local/global memory.

Computer Architecture

Amdahl’s Law: the performance improvement obtained by using a faster mode of execution is limited to the fraction of time the faster mode can be used.

Architecture: the part of a processor that’s visible to a programmer - analogous to the API a data structure library presents.

Arithmetic Intensity: the ratio of the number of floating point operations executed by a piece of code to the number of bytes of memory it reads and writes.

Cache consistency: all views of the data at an address are the same.

Core: a processing unit that executes a single instruction stream. A multicore processor consists of a set of cores, each executing its own instruction stream.

Gustafson’s Law: Gustafson’s law states that as the problem size scales with the number of processors the maximum speedup $(S)$ a program can achieve is:

\[S = N + (1-P) (1-N)\]

where $N$ is the number of processors and $P$ is the fraction of the total time spent in serial execution.

Microarchitecture: hardware techniques used to implement an architecture efficiently. E.g., cache, pipeline, branch-predictor, prefetcher, parallel dispatch, etc.

Memory Controller: the hardware that reads and writes from DRAM performs DRAM maintainence events (for example, memory refresh)

Load/Store Architecture: an architecture where instructions either load/store from registers to RAM or perform operations in registers. This is the most prevalent computer architecture today.

Instruction Level Parallelism (ILP): a microarchitectural innovation by which multiple instructions are executed in parallel.

PCIE: the technology used to connect the CPU and devices (GPU, network, hard drive, etc.) - both the interconnect as well as the switch.

Roofline Analysis: a graphical representation of the performance bounds of a processor in terms of flops and arithmetic intensity.

Strong Scaling: strong scaling is a measure of how, for a fixed overall problem size, the time to solution decreases as more processors are added to a system.

Translation Lookaside Buffer (TLB): a hardware cache of the page table, i.e., the mapping from virtual to physical addresses.

Weak Scaling: weak scaling is a measure of how the time to solution changes as more processors are added to a system with a fixed problem size per processor; i.e., where the overall problem size increases as the number of processors is increased.

Miscellaneous

Kineto: the library that traces GPU kernel calls in PyTorch programs - GPU kernels are executed asynchronously on the GPU so special NVIDIA libraries are needed to do the tracing.

PyTorch Profiler: the library that uses Kineto to generated host and device-side timeline traces for PyTorch programs.

References

  1. D. Kirk and W. W. Hwu, Programming Massively Parallel Processors, A Hands on Approach, Third Ed., Morgan Kaufmann 2017.
  2. J. Henessey and D. Patterson, Computer Architecture: A Quantitative Approach, Sixth Ed., Morgan Kaufmann 2019.