Order of Kernels

In the code snippet below the small_large function does ten $32 \times 32$ matrix multiplications, followed by a $1024 \times 1024$ matrix multiplication. The large_small function reverses the order of multiplications. Which function is faster and why?

small_matrix = torch.rand((2**5, 2**5), device = torch.device('cuda:0'))
large_matrix = torch.rand((2**10, 2*10), device = torch.device('cuda:0'))

def small_large():
    for _ in range(10):
       torch.matmul(small_matrix, small_matrix)
    torch.matmul(large_matrix, large_matrix)

def large_small():
    torch.matmul(large_matrix, large_matrix)
    for _ in range(10):
       torch.matmul(small_matrix, small_matrix)

See answer and discussion