LLM Perf Model
GitHub

🤖 AI Bot Sizer

Can your device run an LLM chatbot (like OpenClaw) smoothly? Select your model and device to find out.

💬 Runtime Configuration

🧠 Model

🖥️ Device

🧮 Quantization

📊 Device Utilization

Human reading~4-5 tok/s
Barely usable~100 tok/s
TTFT
Time to First Token
Decode
tokens/sec
Total
Full response
Memory
— / — GB
Max Conc.
conversations

Memory Breakdown

Weights: — GB KV Cache: — GB Activations: — GB Free: — GB

📊 All Models on This Device

🖥️ All Devices for This Model

📈 Decode Speed Comparison

📐 Methodology

TTFT (Time To First Token): Prefill time — processing the full context in one forward pass. Compute-bound. TTFT = FLOPs / (device_TFLOPS × utilization).

Decode Speed (TPS): Tokens per second during generation. Memory-bandwidth-bound — each token reads all weights + KV cache.

Memory: Model weights + KV cache × concurrent_users + activation memory. Quantization reduces both weights and KV cache.

Concurrent Users: Max conversations = floor((VRAM - weights - activations) / KV_cache_per_user).

Note: Estimates are theoretical upper bounds. Real performance is typically 60-80% of estimates due to framework overhead.