GPU Memory

Logical memory hierarchy diagram

Device code can

R/W per-thread registers
R/W per-thread local memory
R/W per-block L1 cache/shared memory
R/W per-grid global memory
Read only per grid constant memory (on device low latency high bandwidth read only cache which stores constants and kernel arguments)
Read only per grid texture memory (on device low latency read only cache for 2D/3D textures)

Host code can

Transfer data to/from per grid global, constant and texture memories

Memory feature and size in A100

Memory	Location	Access	Scope	Lifetime	Amount in A100 SXM
Register	On chip	Read/Write	1 thread	Thread	256 KB per SM
Local*	Off chip	Read/Write	1 thread	Thread	–
Shared**	On chip	Read/Write	All threads in a block	Block	up to 228 KB per SM
Global	Off chip	Read/Write	All threads + host	Host allocation	40 GB or 80 GB
Constant	Off chip	Read only	All threads + host	Host allocation	64 KB
Texture	Off chip	Read only	All threads + host	Host allocation	Depends on textures used

* Local memory is not a physical type of memory, but an abstraction of global memory. It is used only to hold automatic variables. The compiler makes use of local memory when it determines that there is not enough register space to hold the variable.

** Shared memory is configurable up to 228KB per SM, depending on the compute capability.

Caches

Type	Access	Amount in A100 SXM
L1 data cache	Read/Write	192 KB per SM
L2 cache	Read/Write	40 MB