All articles
Model Comparisons

AI Reasoning Models Compared: o3, Gemini Thinking, and Claude Extended Thinking

Reasoning models think before they answer — and the quality difference on complex tasks is substantial. We compared o3, Gemini 2.0 Thinking, and Claude Extended Thinking on math, logic, and multi-step problems.

Travis Johnson

Travis Johnson

Founder, Deepest

August 20, 202513 min read

Reasoning models — AI systems that "think before they answer" — represent a distinct class of models. OpenAI's o3, Google's Gemini 2.0 Thinking, and Anthropic's Claude with Extended Thinking all dramatically outperform standard models on hard math, logic, and multi-step problems, but at 3–10x the cost and much slower response times.

What Is a Reasoning Model?

Standard AI models generate responses token by token without an explicit reasoning phase. Reasoning models generate a chain-of-thought "scratchpad" before producing their final answer — essentially showing their work. This extended computation allows them to catch errors, backtrack from wrong paths, and verify conclusions before stating them.

The thinking tokens aren't always visible to users (o3 hides them by default; Claude Extended Thinking shows a summary), but they happen regardless. A reasoning model solving a complex math problem might generate 5,000–20,000 thinking tokens before producing a 200-token answer.

Benchmark Comparison: Standard vs Reasoning Models

Model MATH Score GPQA Science ARC-AGI Cost Premium vs GPT-4o
o3 (OpenAI) 97.1% 87.7% 87.5% ~8x
DeepSeek R1 97.3% 71.5% ~0.5x (cheaper than GPT-4o)
Gemini 2.0 Thinking 88.9% 75.1% 63.0% ~3x
Claude (Extended Thinking) 78.2% 68.4% ~4x
GPT-4o (standard) 76.6% 53.6% 5.0% 1x (baseline)
o4-mini (OpenAI) 96.4% 83.2% ~2x
Key Finding: On MATH, the best reasoning models (o3, DeepSeek R1) score 97%+ versus GPT-4o's 76.6% — a 20-point gap. On ARC-AGI (an abstract reasoning benchmark), o3 scores 87.5% versus GPT-4o's 5% — a staggering difference that illustrates why reasoning models are genuinely a different category.

OpenAI o3: The Current Leader on Hard Reasoning

o3 is OpenAI's most capable reasoning model and the strongest performer on the hardest benchmarks. Its 87.5% on ARC-AGI — a benchmark designed to test "general intelligence" by requiring novel reasoning rather than pattern matching — is remarkable. GPT-4o scores 5% on the same test.

o3 is expensive: roughly 8x the cost of GPT-4o. A task that costs $0.50 with GPT-4o can cost $4.00 with o3. For most tasks, this premium isn't justified. For genuinely hard reasoning problems where accuracy is critical — mathematical proofs, complex code architecture, scientific analysis — it often is.

DeepSeek R1: Reasoning at Competitive Prices

DeepSeek R1 matches o3 on MATH (97.3% vs 97.1%) but at dramatically lower cost — actually cheaper than GPT-4o. This is the most remarkable pricing development in AI reasoning: frontier-level mathematical reasoning at commodity prices.

DeepSeek R1's limitations: its chain-of-thought is often in Chinese, which can be disorienting. Its GPQA science performance (71.5%) trails o3 (87.7%). And the same data sovereignty considerations that apply to DeepSeek V3 apply here.

Gemini 2.0 Flash Thinking: Google's Multimodal Reasoning

Gemini 2.0 Flash Thinking is Google's reasoning model entry, and it uniquely brings multimodal reasoning to this category — it can reason over images and diagrams, not just text. For scientific problems involving visual data or for analyzing charts and technical diagrams, Gemini Thinking has an advantage over text-only reasoning models.

Its benchmark scores are lower than o3 across the board, but it's available at a lower cost premium (~3x GPT-4o vs o3's ~8x). For mixed text/visual reasoning tasks, it's the best current option.

Claude Extended Thinking: Reasoning Transparency

Anthropic's approach to reasoning is distinctive: Claude with Extended Thinking shows users a summary of its reasoning process before the final answer. This transparency is valuable for auditing AI reasoning — you can see not just the conclusion but the logic that led there.

On benchmarks, Claude Extended Thinking is behind o3 and DeepSeek R1 on math, but performs well on nuanced reasoning tasks involving ambiguous or complex language. For legal reasoning, policy analysis, and complex argument evaluation, Claude's transparent reasoning process can be more useful than raw benchmark performance suggests.

When Reasoning Models Are Worth the Cost

Reasoning models are overkill for most everyday AI tasks. They're worth the cost when:

  • Mathematical proofs or complex calculations: Standard models make errors at ~4-5 steps; reasoning models can handle 10+ step derivations
  • Debugging complex code: When a bug requires tracing logic across many interdependent functions
  • Scientific analysis: When interpreting experimental data requires integrating multiple principles
  • Legal analysis: When statutes and precedents must be applied to novel situations
  • Hard logic puzzles: Any task where you need to reason from first principles rather than pattern match

When to Stick with Standard Models

For the following tasks, standard models match reasoning models at a fraction of the cost:

  • Writing, summarization, and editing
  • Standard coding tasks (most software development)
  • General question answering
  • Data extraction and formatting
  • Research synthesis from provided documents

Prompting Reasoning Models Effectively

Reasoning models respond differently to prompts than standard models:

  • State the problem precisely: Vague questions waste expensive thinking tokens on scoping instead of solving
  • Provide relevant constraints upfront: The model should have all constraints before it starts reasoning, not discover them in follow-ups
  • Ask for verification: "Check your answer" is redundant for reasoning models — they self-verify. Skip it.
  • Trust the thinking: Don't interrupt or redirect the chain-of-thought mid-response if the model is working through visible reasoning

Frequently Asked Questions

Are reasoning models always better?

No — for straightforward tasks, reasoning models produce similar quality to standard models while costing more and taking longer. They excel specifically on tasks that require multi-step logical inference. For everyday writing and Q&A, use a standard model.

Why is o4-mini competitive with o3?

o4-mini is a smaller reasoning model optimized for cost-efficiency. Despite being "mini," it achieves 96.4% on MATH — within 0.7% of o3. For math-heavy tasks where you want reasoning model quality at lower cost, o4-mini is an excellent option.

Does extended thinking make every answer better?

Not necessarily. For simple questions, extended thinking is unnecessary overhead. For hard reasoning problems, it substantially improves accuracy. For tasks in between, the improvement is marginal — and may not justify the cost premium.

Can I see the reasoning chain?

DeepSeek R1 and Claude Extended Thinking expose the reasoning chain. OpenAI's o-series models do not show thinking tokens to users. Gemini Thinking shows a summary of the reasoning process.

reasoning modelso3Gemini ThinkingClaudechain-of-thought

See it for yourself

Run any prompt across ChatGPT, Claude, Gemini, and 300+ other models simultaneously. Free to try, no credit card required.

Try Deepest free →

Related articles