LLM Inference Performance Estimator

Estimate prefill latency, decode throughput, memory usage, and TTFT for LLMs on various GPUs using Roofline analysis.

🧠 Model

🖥️ Device

Vendor Device

🧮 Quantization

Quantization

KV Cache Precision

📊 Device Utilization

Device FLOPS Utilization (%)

Device Memory Utilization (%)

Device Network BW Utilization (%)

⚙️ Runtime Configuration

Prompt Length (prefill)512

Output Length (decode)256

Batch Size1

Tensor Parallel (GPUs)1

FlashAttention

IO-aware tiling

Prefill Latency

—

Time to First Token

Decode Speed

—

tokens/sec

Total Time

—

Prefill + Decode

Model Memory

—

Weights + KV Cache

📊 Performance Breakdown

📈 Roofline Analysis

💾 Memory Breakdown

Model Weights —

KV Cache (per request) —

Activation Memory (est.) —

Memory Usage —

Want per-operation breakdown? 🔬 Per-Op Layer Breakdown →

🔀 Multi-Device Comparison

Compare current model + settings across all devices (only showing devices with enough VRAM)

🔄 Multi-GPU Scaling

Tensor Parallel performance scaling with interconnect-aware communication modeling

Config	Decode	Prefill	Speedup	Efficiency	VRAM/GPU

📐 Modeling Methodology & Formulas

Prefill (Compute-Bound): Processing the entire prompt is compute-intensive. FLOPs ≈ 2 × Params × SeqLen + Attention O(n²). For MoE models, only active parameters participate.

Prefill Time = (Linear FLOPs + Attn FLOPs) / (Effective TFLOPS × 10¹²)

Decode (Memory-Bandwidth-Bound): Each token reads all weights + KV cache from VRAM. Time = max(compute, memory).

Decode Time/Token = Model Size (bytes) / (Memory BW × TP × BW Utilization)

KV Cache: 2 × layers × kv_heads × head_dim × seq_len × bytes. GQA/MLA significantly reduces KV Cache.

Arithmetic Intensity: FLOPs/Byte — determines compute-bound vs memory-bound. Prefill AI ≈ SeqLen (high), Decode AI ≈ 1 (low).

FlashAttention: IO-aware tiling keeps N×N scores in SRAM. Without it, O(N²) HBM traffic + lower utilization (~40%).