For years, AI progress followed a predictable pattern: more compute, more data, better models. That scaling law held from GPT-2 through GPT-4. The picture is now more complicated — scaling is slowing, inference-time compute has emerged as a new axis of improvement, and the debate over what comes next is genuine.

The Scaling Law Era (2018–2023)

The original AI scaling hypothesis, formalized in the Chinchilla paper (Hoffmann et al., 2022), held that model capability scales predictably with compute — specifically with both model size and training data in a particular ratio. Double the compute, get a reliably better model.

This held remarkably well. GPT-3 to GPT-4 was a significant capability jump. Llama 2 to Llama 3 was another. Each scaling step produced measurable improvements across benchmarks.

The scaling era had important characteristics:

Progress was expensive but predictable
Labs with more capital could simply train larger models
Benchmark scores improved nearly monotonically with scale
Well-resourced labs (OpenAI, Google, Anthropic) had structural advantages

The Slowdown and What Caused It

By 2024, several signals suggested the scaling curve was flattening:

GPT-4 to GPT-5 took longer and cost more than GPT-3 to GPT-4
Reports emerged that some labs were finding diminishing returns on pure scale increases
The gap between frontier models narrowed — a sign that all were hitting similar limits
High-quality training data became a constraint; the internet had largely been scraped

The data constraint is real and underappreciated. Models need high-quality text to train on, and there's a finite amount of high-quality human-written text on the internet. Synthetic data (AI-generated training data) is being explored as a solution, but current evidence suggests pure synthetic data degrades rather than improves model quality.

Inference-Time Compute: The New Scaling Axis

OpenAI's o-series reasoning models introduced a different kind of scaling: instead of using more compute at training time, use more compute at inference time. The model "thinks" more before answering, spending computation on generating and evaluating intermediate reasoning steps.

The results have been striking. o3 achieves 97.1% on MATH — close to human expert performance — while GPT-4o achieves 76.6%. The capability jump wasn't from training a fundamentally better model; it was from using the same kind of model differently, with extended computation.

Key Finding: OpenAI's internal research showed that inference-time compute scaling followed a similar curve to training-time scaling — doubling the computation used at inference time produced predictable, reliable improvement on hard tasks. This opened a new dimension of scaling.

What This Means for Users

The inference-time compute shift has practical implications:

More expensive to get the best results: Reasoning models cost 3–10x more than standard models because they use more compute per query
Hard tasks became much more solvable: Problems that stumped GPT-4 are routinely solved by o3
Simple tasks unchanged: Everyday queries don't benefit from extended reasoning
Quality and cost become explicit tradeoffs: You can choose how much compute to spend per query

The Arguments: Will AI Keep Improving?

The Optimistic View

AI progress has repeatedly exceeded expectations. Every time researchers predicted a ceiling, the next architectural innovation broke through it. Transformers themselves were unexpected. Reasoning models were unexpected. The argument that "we've found all the tricks" has been wrong before.

Additionally, significant compute investments continue. NVIDIA's data center revenue, hyperscaler capital expenditure plans, and government AI infrastructure investments all suggest massive compute growth ahead. If new algorithmic innovations emerge, there's a lot of compute ready to apply them.

The Pessimistic View

Current models still fail at tasks that require genuine novelty, systematic reasoning about complex causal structures, and integrating knowledge across incompatible domains. These failures may not be solvable with current architectures regardless of scale.

The training data problem is real. Web text is finite, and high-quality web text is substantially finite. Generating synthetic training data without introducing distribution shift is an open research problem.

The Nuanced View

Progress will continue but may be less uniform. Specialized capabilities (math, code, specific scientific domains) will continue improving — reasoning models are still early. General-purpose capability improvements may slow as easy wins are exhausted. Novel architectural innovations could restart the scaling curve, but they can't be predicted in advance.

The Architectural Frontier

Several architectural directions are being actively explored as alternatives or supplements to transformer scaling:

Mixture of Experts (MoE): Already deployed in GPT-4 and DeepSeek V3 — activates only a subset of parameters per query, improving efficiency
State Space Models: Mamba and similar architectures are more efficient for long sequences than transformers
Retrieval-Augmented Generation (RAG): Separates world knowledge from reasoning — model doesn't need to memorize facts, just access them from a retrieved corpus
Neural Turing Machine-style architectures: External memory systems that could enable better long-horizon reasoning

Whether any of these restart the scaling curve is genuinely unknown.

What to Expect as a User

For practical purposes, AI capabilities will continue to improve in the near term. What's more uncertain is the rate and whether the improvements will be broad (all tasks get better) or narrow (specific capability types improve while others plateau).

The most reliable expectation: reasoning models will continue to improve on hard math and logic tasks. Code generation will continue improving. General text tasks are already quite good and may improve only incrementally. Multimodal capabilities remain a fast-moving area.

Frequently Asked Questions

Will AI reach human-level intelligence?

Current AI models exceed human performance on many specific tasks (math benchmarks, coding competitions, standardized tests) but fail at general intelligence tasks that humans handle easily. Whether transformer-based language models can achieve broad general intelligence is genuinely contested among researchers. There is no consensus answer.

What would cause AI progress to stop?

Plausible stopping factors include: training data exhaustion, lack of architectural innovation beyond transformers, regulatory restrictions on compute or training, and fundamental limits in what transformer-style systems can learn. None of these is clearly imminent.

Should I base business decisions on continued AI progress?

Plan for continued improvement in specific areas (coding assistance, document processing, analytical tasks) while remaining flexible about which specific capabilities improve and when. Don't plan for capabilities that don't currently exist — the history of AI prediction is full of overconfident timelines.

Are AI benchmarks an accurate predictor of future progress?

Benchmarks measure specific capabilities, not general capability potential. Models near saturation on easy benchmarks (HellaSwag) have no room to improve on those benchmarks regardless of underlying capability improvement. Focus on hard benchmarks (GPQA, ARC-AGI) as better indicators of genuine progress.

Will AI Models Keep Getting Better? The Scaling Debate Explained