DeepSeek just dropped V4 Pro and V4 Flash — a 1.6-trillion-parameter flagship and its ultra-cheap sibling — and the benchmarks are genuinely startling. V4 Pro beats Claude Opus 4.6 and GPT-5.4 on coding tasks, matches them on reasoning, and costs a fraction of what either charges. Both models launched today with open weights and a million-token context window.
What DeepSeek V4 Actually Is
DeepSeek V4 is a Mixture-of-Experts (MoE) architecture that comes in two sizes:
- V4 Pro: 1.6 trillion total parameters, 49 billion active per token. This is the flagship — competitive with the best closed models from OpenAI and Anthropic.
- V4 Flash: 284 billion total parameters, 13 billion active. Designed for speed and cost-efficiency while retaining surprisingly strong performance.
Both models support a 1-million-token context window and a 384,000-token output limit — the largest output window of any production model right now. For comparison, Claude Opus 4.6 and GPT-5.4 max out at 128K output tokens. DeepSeek is offering three times that.
The Efficiency Story Is the Real News
Raw benchmark scores grab headlines, but the efficiency improvements in V4 are what matter for production use. DeepSeek introduced what they call a Hybrid Attention Architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA).
The practical impact: at the full 1M-token context length, V4 Pro uses just 27% of the inference FLOPs and 10% of the KV cache compared to V3.2. V4 Flash pushes even further — 10% of FLOPs and 7% of KV cache. This is how DeepSeek can offer a million-token context window at prices that would have been unthinkable a year ago.
Pricing That Puts Pressure on Everyone
Here's where things get uncomfortable for OpenAI and Anthropic:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| DeepSeek V4 Flash | $0.14 | $0.28 |
| DeepSeek V4 Pro | $1.74 | $3.48 |
| Claude Opus 4.6 | $5.00 | $25.00 |
| GPT-5.4 | $2.50 | $15.00 |
V4 Pro delivers comparable coding and reasoning performance to Claude Opus 4.6 at roughly a third of the input cost and one-seventh of the output cost. V4 Flash is in a different league entirely — it's cheaper than almost every model on the market while maintaining strong performance.
DeepSeek also offers a cache-hit discount: when input tokens share common prefixes with previous requests (system prompts, tool definitions, etc.), the input price drops to $0.145 per million tokens for V4 Pro — a 92% discount.
Math and Reasoning: Genuinely Strong
Coding isn't the only area where V4 Pro impresses. On IMOAnswerBench (competition-level math), V4 Pro scores 89.8 — ahead of Claude Opus 4.6 (75.3) and Gemini 3.1 Pro (81.0). Only GPT-5.4 edges it out at 91.4.
Both models support dual reasoning modes — thinking and non-thinking — with three effort levels (high, max, and non-think). This gives you control over the speed/quality tradeoff on a per-request basis.
The Context Window Is Real
A million-token context window is only useful if the model can actually attend to information throughout that window. DeepSeek's Hybrid Attention Architecture was designed specifically for this problem. The combination of compressed sparse and heavily compressed attention lets V4 process long documents without the degradation that plagued earlier long-context models.
With a 384K output limit on top of the 1M input window, V4 is uniquely suited for tasks that require both reading and generating large volumes of text — full-codebase refactoring, long-document analysis, or multi-step research workflows.
Open Weights, Open Access
Both V4 Pro and V4 Flash are available as open-weight models on Hugging Face. The weights are available for download and local deployment, which matters for enterprises with data sovereignty requirements or teams that want to run inference on their own hardware.
Both models are also available immediately via the DeepSeek API and through aggregators like OpenRouter. On Deepest, you can query V4 Pro and V4 Flash alongside Claude, GPT, Gemini, and every other model — which is exactly the kind of head-to-head comparison that reveals each model's real strengths.
Where V4 Falls Short
No model is perfect, and V4 has clear limitations worth noting:
- Knowledge tasks: V4 Pro trails Gemini 3.1 Pro on knowledge-heavy benchmarks. If your use case is primarily factual Q&A rather than coding or reasoning, Gemini still has the edge.
- Complex agentic workflows: V4 Flash's smaller parameter count (13B active) means it falls behind Pro on the most complex multi-step agent tasks. For heavy agentic work, Pro is worth the extra cost.
- Data sovereignty: DeepSeek is a Chinese company. For users or organizations with compliance requirements around data processing jurisdiction, this remains a consideration regardless of model quality.
The Bottom Line
DeepSeek V4 Pro is the strongest open-weight model ever released, and it's competitive with the best closed models at a fraction of the price. V4 Flash is arguably even more significant — it delivers strong performance at a price point ($0.14/$0.28 per million tokens) that makes AI accessible for use cases that were previously cost-prohibitive.
The gap between open and closed models continues to narrow, and V4 might be the release that closes it entirely for most practical applications.
Frequently Asked Questions
How does DeepSeek V4 Pro compare to GPT-5.4?
V4 Pro beats GPT-5.4 on LiveCodeBench (93.5% vs GPT-5.4's score) and is competitive on IMOAnswerBench (89.8 vs 91.4). GPT-5.4 has an edge on some knowledge benchmarks, but V4 Pro costs significantly less. On Deepest, you can run the same prompt through both and compare the actual outputs side by side.
Is DeepSeek V4 Flash good enough for production use?
For most tasks, yes. V4 Flash with its "max" thinking effort mode achieves comparable reasoning performance to V4 Pro. The main gaps are on pure knowledge tasks and the most complex agentic workflows. At $0.14/$0.28 per million tokens, it's worth testing against more expensive models — you may find the quality difference is smaller than the price difference suggests.
What's the 384K output limit good for?
Most models cap output at 64K–128K tokens. The 384K limit means V4 can generate entire codebases, full book chapters, or comprehensive analysis reports in a single response without hitting truncation. Combined with the 1M input context, it's ideal for large-scale code refactoring or document transformation tasks.