🤖 AI Bot Sizer
Can your device run an LLM chatbot (like OpenClaw) smoothly? Select your model and device to find out.
💬 Runtime Configuration
🧠 Model
🖥️ Device
🧮 Quantization
📊 Device Utilization
Human reading~4-5 tok/s
Barely usable~100 tok/s
TTFT
—
Time to First Token
Decode
—
tokens/sec
Total
—
Full response
Memory
—
— / — GB
Max Conc.
—
conversations
Memory Breakdown
Weights: — GB
KV Cache: — GB
Activations: — GB
Free: — GB
📊 All Models on This Device
🖥️ All Devices for This Model
📈 Decode Speed Comparison
📐 Methodology
TTFT (Time To First Token): Prefill time — processing the full context in one forward pass. Compute-bound. TTFT = FLOPs / (device_TFLOPS × utilization).
Decode Speed (TPS): Tokens per second during generation. Memory-bandwidth-bound — each token reads all weights + KV cache.
Memory: Model weights + KV cache × concurrent_users + activation memory. Quantization reduces both weights and KV cache.
Concurrent Users: Max conversations = floor((VRAM - weights - activations) / KV_cache_per_user).
Note: Estimates are theoretical upper bounds. Real performance is typically 60-80% of estimates due to framework overhead.