LLM Perf Model
GitHub

LLM Inference Performance Estimator

Estimate prefill latency, decode throughput, memory usage, and TTFT for LLMs on various GPUs using Roofline analysis.

🧠 Model

🖥️ Device

🧮 Quantization

📊 Device Utilization

⚙️ Runtime Configuration

Prefill Latency
Time to First Token
Decode Speed
tokens/sec
Total Time
Prefill + Decode
Model Memory
Weights + KV Cache

📊 Performance Breakdown

📈 Roofline Analysis

💾 Memory Breakdown

Model Weights
KV Cache (per request)
Activation Memory (est.)
Memory Usage
Want per-operation breakdown? 🔬 Per-Op Layer Breakdown →

🔀 Multi-Device Comparison

Compare current model + settings across all devices (only showing devices with enough VRAM)

🔄 Multi-GPU Scaling

Tensor Parallel performance scaling with interconnect-aware communication modeling

Config Decode Prefill Speedup Efficiency VRAM/GPU
📐 Modeling Methodology & Formulas

Prefill (Compute-Bound): Processing the entire prompt is compute-intensive. FLOPs ≈ 2 × Params × SeqLen + Attention O(n²). For MoE models, only active parameters participate.

Prefill Time = (Linear FLOPs + Attn FLOPs) / (Effective TFLOPS × 10¹²)

Decode (Memory-Bandwidth-Bound): Each token reads all weights + KV cache from VRAM. Time = max(compute, memory).

Decode Time/Token = Model Size (bytes) / (Memory BW × TP × BW Utilization)

KV Cache: 2 × layers × kv_heads × head_dim × seq_len × bytes. GQA/MLA significantly reduces KV Cache.

Arithmetic Intensity: FLOPs/Byte — determines compute-bound vs memory-bound. Prefill AI ≈ SeqLen (high), Decode AI ≈ 1 (low).

FlashAttention: IO-aware tiling keeps N×N scores in SRAM. Without it, O(N²) HBM traffic + lower utilization (~40%).