LLM Inference Performance Estimator
Estimate prefill latency, decode throughput, memory usage, and TTFT for LLMs on various GPUs using Roofline analysis.
🧠 Model
🖥️ Device
🧮 Quantization
📊 Device Utilization
⚙️ Runtime Configuration
📊 Performance Breakdown
📈 Roofline Analysis
💾 Memory Breakdown
🔀 Multi-Device Comparison
Compare current model + settings across all devices (only showing devices with enough VRAM)
🔄 Multi-GPU Scaling
Tensor Parallel performance scaling with interconnect-aware communication modeling
| Config | Decode | Prefill | Speedup | Efficiency | VRAM/GPU |
|---|
📐 Modeling Methodology & Formulas
Prefill (Compute-Bound): Processing the entire prompt is compute-intensive. FLOPs ≈ 2 × Params × SeqLen + Attention O(n²). For MoE models, only active parameters participate.
Prefill Time = (Linear FLOPs + Attn FLOPs) / (Effective TFLOPS × 10¹²)
Decode (Memory-Bandwidth-Bound): Each token reads all weights + KV cache from VRAM. Time = max(compute, memory).
Decode Time/Token = Model Size (bytes) / (Memory BW × TP × BW Utilization)
KV Cache: 2 × layers × kv_heads × head_dim × seq_len × bytes. GQA/MLA significantly reduces KV Cache.
Arithmetic Intensity: FLOPs/Byte — determines compute-bound vs memory-bound. Prefill AI ≈ SeqLen (high), Decode AI ≈ 1 (low).
FlashAttention: IO-aware tiling keeps N×N scores in SRAM. Without it, O(N²) HBM traffic + lower utilization (~40%).