Tags
Arithmetic Intensity Bank Conflicts CUDA Caching Allocator CUDA Graphs CUDA Launch Queue CUDA Streams Collectives Device Synchronization Device to Host Synchronization GPU Utilization GPUDirect HBM Horizontal Fusion Memory Bandwidth NCCL NCU NVLink NVSwitch Overlapping Streams PCIe Pinned Memory Precision Formats PyTorch Dispatch Overhead Quantization Shared Memory Stream Sychronization Tensor Cores Vertical Fusion torch.compile