All articles
Model Comparisons

The Fastest AI Models in 2025: Tokens Per Second Benchmarked

Speed matters for interactive AI applications. We benchmarked tokens per second and first-token latency across 15+ models to rank the fastest LLMs and explain when to choose speed over quality.

Travis Johnson

Travis Johnson

Founder, Deepest

August 12, 20259 min read

Response speed varies by an order of magnitude across AI models. Gemini 2.0 Flash and Groq-hosted Llama models are the fastest at over 200 tokens per second; frontier models like GPT-4o and Claude average 100–120 tokens per second; reasoning models can be 10x slower due to extended computation.

Why Speed Matters

For interactive chat, speed affects user experience directly — a model that responds in 2 seconds feels snappier than one that takes 8 seconds. For production applications, tokens per second (TPS) determines throughput, which affects cost and scalability. For real-time streaming applications, first-token latency (how long until the first word appears) matters more than overall throughput.

Speed Benchmarks: Tokens Per Second

These measurements reflect typical performance under normal load conditions, accessed via standard API. Performance varies by time of day, request complexity, and infrastructure load.

Model Tokens/Second (typical) First Token Latency Best For
Groq — Llama 3.3 70B 280–320 TPS 80–120ms Real-time applications
Gemini 2.0 Flash 200–250 TPS 200–400ms Fast interactive chat
Gemini 2.0 Flash Lite 230–270 TPS 150–300ms High-volume applications
GPT-4o mini 140–170 TPS 300–500ms Cost-efficient interactive
Claude 3.5 Haiku 130–160 TPS 250–450ms Fast Claude queries
GPT-4o 100–130 TPS 400–700ms Standard interactive use
Claude 3.5 Sonnet 90–120 TPS 450–800ms Quality-focused tasks
Gemini 2.0 Pro 80–110 TPS 500–900ms Long-context tasks
Mistral Large 2 70–100 TPS 500–800ms General use
Claude 3 Opus 40–60 TPS 800–1,500ms High-quality, non-time-sensitive
o3 (reasoning) 15–30 TPS 2,000–10,000ms Hard math/logic problems
DeepSeek R1 (reasoning) 20–35 TPS 1,500–8,000ms Complex reasoning
Groq's Unique Approach: Groq uses custom LPU (Language Processing Unit) hardware specifically optimized for LLM inference. The result is dramatically higher tokens per second than GPU-based inference. The tradeoff is a smaller model selection — Groq primarily hosts open-weight models like Llama.

Speed vs Quality Tradeoffs

Faster models are almost always less capable models. Gemini 2.0 Flash is significantly faster than Gemini 2.0 Pro, but also less accurate and capable. GPT-4o mini is faster than GPT-4o but scores lower on complex tasks.

The optimal choice depends on your task difficulty distribution:

  • Simple tasks (Q&A, summarization, formatting): Fast, cheaper models like GPT-4o mini or Gemini 2.0 Flash handle these well
  • Medium tasks (analysis, writing, code): GPT-4o and Claude 3.5 Sonnet offer the best quality-speed balance
  • Hard tasks (complex reasoning, architecture, research): Slower frontier models are worth the wait

First Token Latency vs Throughput

These are different metrics and matter for different use cases:

First token latency is how long before the first word appears. This determines how responsive the experience feels. For chat interfaces, first token latency under 500ms feels fast; above 1 second feels slow.

Throughput (tokens per second) is how fast the rest of the response generates. For long responses, throughput matters more than first token latency once the response has started.

Some models optimize for low first-token latency (streaming feels instant but may slow midway through). Others optimize for high throughput (starts slower but finishes fast). Gemini Flash excels at both.

Speed in Production Applications

For developers building AI-powered applications, speed has direct cost and UX implications:

  • A chatbot handling 1,000 concurrent users needs throughput that scales
  • Streaming responses (showing text as it generates) requires good first-token latency
  • Batch processing jobs care more about throughput than latency
  • User-facing features need under 1-second first response to avoid noticeable loading

When to Accept Slower Speed

Speed matters less when:

  • The task is complex enough that accuracy gains from a better model outweigh the wait
  • The output is being processed asynchronously (user doesn't wait)
  • You're using reasoning models that need extended computation time
  • The task involves very long contexts where processing time is inherent

Speed Optimization Strategies

  • Choose the smallest capable model: Don't use GPT-4o for tasks that GPT-4o mini handles well
  • Use streaming: Stream API responses so users see output immediately rather than waiting for completion
  • Reduce context: Remove unnecessary context from prompts; shorter inputs are processed faster
  • Consider Groq: For open models where speed is critical, Groq's LPU inference is 3–5x faster than GPU inference
  • Caching: Cache responses to common queries to eliminate model call latency entirely

Frequently Asked Questions

Which is the fastest AI model available?

Groq-hosted Llama 3.3 70B is the fastest widely available model at 280–320 tokens per second. Among major commercial API models, Gemini 2.0 Flash reaches 200–250 TPS. GPT-4o mini and Claude Haiku are the fastest options from their respective providers.

Why are reasoning models so slow?

Reasoning models like o3 and DeepSeek R1 generate extended "thinking" tokens before producing their final answer. These thinking tokens aren't always shown to the user, but they require computation time. A complex math problem might require the model to generate 5,000–20,000 thinking tokens before producing its response.

Does context length affect speed?

Yes, significantly. Processing a 100K-token context requires much more computation than processing a 1K-token context. First token latency scales roughly linearly with context length. If speed matters, keep contexts as short as possible.

Is speed consistent across the day?

No. AI API performance varies with load. Peak hours (9 AM–5 PM Pacific for US providers) typically show higher latency. Some providers offer dedicated compute tiers that provide more consistent performance at higher cost.

AI speedtokens per secondlatencyGemini FlashGPT-4o mini

See it for yourself

Run any prompt across ChatGPT, Claude, Gemini, and 300+ other models simultaneously. Free to try, no credit card required.

Try Deepest free →

Related articles