🤖 AI Bot Sizer

Can your device run an LLM chatbot (like OpenClaw) smoothly? Select your model and device to find out.

💬 Runtime Configuration

Context Length20480

Response Length256

Concurrent Users1

🧠 Model

🖥️ Device

Vendor Device

🧮 Quantization

Quantization

KV Cache Precision

📊 Device Utilization

Device FLOPS Utilization (%)

Device Memory Utilization (%)

Device Network BW Utilization (%)

FlashAttention

Human reading~4-5 tok/s

Barely usable~100 tok/s

TTFT

—

Time to First Token

Decode

—

tokens/sec

Total

—

Full response

Memory

—

— / — GB

Max Conc.

—

conversations

Memory Breakdown

Weights: — GB KV Cache: — GB Activations: — GB Free: — GB

📊 All Models on This Device

🖥️ All Devices for This Model

📈 Decode Speed Comparison

📐 Methodology

TTFT (Time To First Token): Prefill time — processing the full context in one forward pass. Compute-bound. TTFT = FLOPs / (device_TFLOPS × utilization).

Decode Speed (TPS): Tokens per second during generation. Memory-bandwidth-bound — each token reads all weights + KV cache.

Memory: Model weights + KV cache × concurrent_users + activation memory. Quantization reduces both weights and KV cache.

Concurrent Users: Max conversations = floor((VRAM - weights - activations) / KV_cache_per_user).

Note: Estimates are theoretical upper bounds. Real performance is typically 60-80% of estimates due to framework overhead.