The code snippet below sums up the elements in a 1D tensor of size $4096$ in three different ways. Which implementation is the fastest, which one is the slowest and why?

def first_sum(cuda_tensor):
    total = 0.0
    for i in range(cuda_tensor.size()[0]):
        total += cuda_tensor[i].cpu()
    return total

def second_sum(cuda_tensor):
    total = torch.zeros(1, device='cuda')
    for i in range(cuda_tensor.size()[0]):
        total += cuda_tensor[i]
    return total

def third_sum(cuda_tensor):
    total = 0.0
    tensor_on_cpu = cuda_tensor.cpu()
    for i in range(tensor_on_cpu.size()[0]):
        total += tensor_on_cpu[i]
    return total

See answer and discussion