Tags

Arithmetic Intensity Bank Conflicts CUDA Caching Allocator CUDA Graphs CUDA Launch Queue CUDA Streams Collectives Device Synchronization Device to Host Synchronization GPU Utilization GPUDirect HBM Horizontal Fusion Memory Bandwidth NCCL NCU NVLink NVSwitch Overlapping Streams PCIe Pinned Memory Precision Formats PyTorch Dispatch Overhead Quantization Shared Memory Stream Sychronization Tensor Cores Vertical Fusion torch.compile

Posts by Tags

Arithmetic Intensity

To Fuse or Not to Fuse?

Accounting for FLOPS

Bank Conflicts

Memorable Mysteries

CUDA Caching Allocator

Memorable Mysteries

CUDA Graphs

To Fuse or Not to Fuse?

Order of Kernels

CUDA Launch Queue

Order of Kernels

CUDA Streams

Swimming in Streams

Collectives

Communication is the Key to Success

Device Synchronization

Swimming in Streams

Device to Host Synchronization

The Faster Way to Add?

GPU Utilization

The Faster Way to Add?

GPUDirect

Communication is the Key to Success

HBM

Memorable Mysteries

Horizontal Fusion

To Fuse or Not to Fuse?

Memory Bandwidth

Accounting for FLOPS

NCCL

Communication is the Key to Success

NCU

Accounting for FLOPS

NVLink

Communication is the Key to Success

NVSwitch

Communication is the Key to Success

Overlapping Streams

Swimming in Streams

PCIe

Communication is the Key to Success

Pinned Memory

Memorable Mysteries

Precision Formats

Quantization Quirks

PyTorch Dispatch Overhead

Order of Kernels

The Faster Way to Add?

Quantization

Quantization Quirks

Shared Memory

Memorable Mysteries

Stream Sychronization

Swimming in Streams

Tensor Cores

Quantization Quirks

Vertical Fusion

To Fuse or Not to Fuse?

torch.compile

To Fuse or Not to Fuse?