Order of Kernels
In the code snippet below the small_large function does ten $32 \times 32$ matrix multiplications, followed by a $1024 \times 1024$ matrix multiplication. The large_small function reverses the order of multiplications. Which function is faster and why?
small_matrix = torch.rand((2**5, 2**5), device = torch.device('cuda:0'))
large_matrix = torch.rand((2**10, 2*10), device = torch.device('cuda:0'))
def small_large():
for _ in range(10):
torch.matmul(small_matrix, small_matrix)
torch.matmul(large_matrix, large_matrix)
def large_small():
torch.matmul(large_matrix, large_matrix)
for _ in range(10):
torch.matmul(small_matrix, small_matrix)