Per-Op Layer Breakdown

Per-operation analysis of a single Transformer layer — FLOPs, IO bytes, arithmetic intensity, and bottleneck type.

Vendor Device

Device FLOPS Utilization (%)

Device Memory Utilization (%)

Device Network BW Utilization (%)

Quantization

KV Cache Precision

Prompt Length512

Output Length256

Batch1

TP1

FlashAttn

Operation	FLOPs	Bytes (R+W)	AI	Bound	Time	%

Attention

—

Q/K/V Proj + Attn + O Proj

FFN / MLP

—

Gate, Up, Down + Activation

Norm + Residual

—

RMSNorm, Residual connections

See overall performance estimates