Epoch AI Report Reveals Accelerated Model Growth Despite Benchmark Struggles

FrontierMath Benchmark Results
According to data reviewed by toolmesh.ai, Epoch AI’s latest analysis indicates that while artificial intelligence capabilities are accelerating, frontier models continue to struggle with expert-level mathematics. In tests conducted on the FrontierMath benchmark, open-weights models from Chinese developers largely failed to register scores on the most difficult problems.
The assessment found that on levels 1 through 3 of the benchmark, leading Chinese open-source models lag behind global state-of-the-art systems by approximately seven months. On the more advanced Level 4, nearly all tested models received a score of zero. The sole exception was DeepSeek-V3.2 (Thinking), which answered a single question correctly for a score of approximately 2 percent.
Western counterparts, including iterations of GPT and Gemini, also performed poorly on FrontierMath, despite achieving high accuracy on traditional tests like GSM-8k and MATH. While these top-tier models slightly outperformed open-source alternatives, the low scores across the board highlight the difficulty of the evaluation. FrontierMath consists of original problems designed by over 60 experts, including Fields Medalists, covering complex fields such as algebraic geometry and category theory. The results suggest that for research-level tasks requiring extended reasoning, current AI systems function less like reliable solvers and more like novices.

Accelerated Capability Growth
Despite these specific hurdles, Epoch AI’s year-end review challenges the narrative that AI development has stagnated. The organization’s Epoch Capabilities Index (ECI), which tracks the trajectory of frontier model performance, shows that the growth rate of AI capabilities has nearly doubled since April 2024.
The report attributes this surge to advancements in reasoning models and an increased emphasis on reinforcement learning. Contrary to the perception that progress slowed following the release of GPT-4, the data suggests that the industry has shifted focus from purely increasing parameter counts to refining core reasoning skills. The analysis characterizes the progression between model generations—specifically referencing the leap from GPT-4 to projected capabilities of successors like GPT-5—as significant step-changes rather than incremental improvements.
Hardware and Cost Dynamics
The review highlights a dramatic reduction in operational costs. Between April 2023 and March 2025, the price per token for inference dropped by more than 10 times for equivalent performance levels. This trend suggests a shift from technology accessible only to major corporations to a utility viable for broader consumer adoption.
Hardware efficiency has similarly improved. Top-tier open-source models can now run on consumer-grade GPUs with performance lagging behind proprietary frontier models by less than a year. Additionally, the report notes that Nvidia’s deployed compute capacity has doubled every 10 months since 2020, with new flagship chips rapidly dominating the installed base.
Regarding energy consumption, the report clarifies that while aggregate demand is rising exponentially, individual query costs remain manageable. The energy required for a standard inference by models like GPT-4o is estimated to be less than that needed to power a lightbulb for five minutes.

Future Scaling and Economic Impact
Epoch AI’s analysis offers a nuanced view of future development constraints. The report notes that a significant portion of OpenAI’s 2024 compute budget was allocated to experimentation rather than training or deployment, underscoring the R&D-heavy nature of current progress.
Furthermore, the report discusses the "inference wall," citing statements from industry leaders suggesting that current scaling laws for reinforcement learning may hit infrastructure limits within one to two years. This implies that the explosive capability growth observed in 2024 and 2025 could eventually plateau.

On the economic front, Epoch AI presents a divergence from the views held by leaders such as Sam Altman and Demis Hassabis, who argue that AI will drive growth through automated scientific breakthroughs. Instead, the report suggests that the primary economic value will likely stem from the widespread automation of routine tasks across the broader economy. Historical data cited in the review indicates that R&D contributions to productivity have been limited over the last three decades, supporting the theory that AI’s impact will be a gradual, diffuse integration rather than an immediate technological singularity.
