Accounting for FLOPS

Marketing literature for GPUs stresses their high FLOPS. An A100 is advertised as capable of achieving $19.5$ FP32 TFLOPS. The code below performs a number of numerical operations on tensors: addition, scalar multiplication, transcendental functions, and matrix multiplications etc.

import torch
from torch.profiler import ProfilerActivity, profile, tensorboard_trace_handler

def flop_ops():
    SIZE = 2**12
    ones = torch.ones((SIZE, SIZE), device=torch.device('cuda:0'))

    torch.matmul(ones, ones, out=ones)   # matrix multiplication
    ones.mul_(0.5)                       # in place multiplication
    result_mul = ones.mul(0.5)           # out of place multiplication
    total = ones + result_mul            # adding tensors
    result_sum = torch.sum(ones)         # adding elements of a tensor
    result_sqrt = torch.sqrt(ones)       # sqrt takes 7 ops
    result_sin = torch.sin(ones)         # sin takes 17 ops (14 fp64, 3 fp32)
    result_sigmoid = torch.sigmoid(ones) # sigmoid takes 24 ops
    result = torch.log10(ones)           # log10 takes 24 ops
    result = torch.pow(ones, 3.14159)    # pow takes 142 ops

# warmup
flop_ops()

trace_handler = tensorboard_trace_handler(dir_name="./flops_trace", use_gzip=True)
with profile(activities = [ProfilerActivity.CPU, ProfilerActivity.CUDA],
     on_trace_ready = trace_handler,
     record_shapes = True,
     with_stack = True
) as prof:
    flop_ops()

The trace shown below indicates that other than matrix multiplication, all of the other operations are a tiny fraction (~$1\%$) of the advertised flops. Furthermore, simple operations like addition take exactly as long as complex ones like sine and log. Why?

See answer and discussion