OpenAI Releases GPT-5.2, Surpassing Google Gemini 3 Pro in Performance Benchmarks

Victor Zhang
Victor Zhang
A futuristic digital brain with glowing circuits, representing advanced AI, with 'GPT-5.2' subtly integrated into the design.

OpenAI has launched the GPT-5.2 series, including GPT-5.2 Thinking and GPT-5.2 Pro, marking an advancement in AI capabilities, particularly in practical applications. The release coincides with OpenAI's tenth anniversary.

The new models demonstrate improvements across various domains, including table creation, presentation generation, code writing, long document comprehension, tool utilization, and complex multi-step project handling. Visual understanding has also been enhanced, with GPT-5.2 capable of accurately labeling more components on a motherboard.

GPT-5.2's capabilities were highlighted in a scenario involving flight delays and medical seating requirements, where the model reportedly managed rebooking, special seating arrangements, and compensation.

Performance Benchmarks and Economic Value

The ARC-AGI test results indicate that GPT-5.2 Pro achieved a new state-of-the-art score of 90.5%, with an average task cost of $11.64. This represents an efficiency improvement of approximately 390 times compared to a year prior, when a previous model scored 88% with an average task cost of $4500. The new model also surpassed Google Gemini 3 Pro in this evaluation.

In the GDPval test, which assesses performance across 44 professional fields in the top nine U.S. GDP industries, GPT-5.2 Thinking achieved a 71% win rate against human experts in tasks that typically require 4-8 hours for humans. GPT-5.2 Pro is expected to perform even higher. The model's speed is more than 11 times faster than human experts, and its cost is less than 1% of human expert costs.

For investment banking analyst spreadsheet modeling tasks, GPT-5.2 Thinking's average score per task increased by 9.3% from GPT-5.1, rising from 59.1% to 68.4%. These tasks include building three-statement financial models for Fortune 500 companies and constructing leveraged buyout models. A GDPval judge noted the "significant leap in output quality" and that the deliverables appeared to be from a "professional firm."

Access to the new table and presentation creation features in ChatGPT requires a Plus, Pro, Business, or Enterprise subscription, with users needing to select the GPT-5.2 Thinking or Pro version. Complex content generation may take several minutes.

Technical Advancements

GPT-5.2's code capabilities achieved an 80% score on SWE-bench Verified. On SWE-Bench Pro, a more challenging software engineering evaluation that includes Python, JavaScript, TypeScript, and Go, GPT-5.2 Thinking scored 55.6%. Early testers observed improvements in front-end development and complex UI work, particularly with 3D elements.

For long document processing, GPT-5.2 Thinking achieved nearly 100% accuracy on the 4-needle variant with a 256k context length in OpenAI's "Needle in a Haystack" MRCRv2 evaluation. However, performance on the 8-needle variant decreased with context length. The model also supports a concise response mode for tool-intensive, long-running workflows beyond the maximum context window.

Visual understanding improvements include an approximate halving of the error rate in comprehending scientific paper charts and a stronger grasp of spatial element positions in images. In high-resolution graphical screen capture reasoning tests, the model scored 86.3% when combined with Python tools. OpenAI recommends enabling tools for such visual tasks.

Tool calling capabilities also advanced, with GPT-5.2 Thinking scoring 98.7% in the Tau2-bench Telecom multi-turn interactive phone customer service scenario evaluation and 82% in the Tau2-bench Retail scenario. These results suggest improved end-to-end workflows for tasks like customer support case resolution and data extraction.

Scientific and Factual Improvements

OpenAI states that GPT-5.2 Pro and GPT-5.2 Thinking are suitable models for assisting scientific research. In the GPQA Diamond graduate-level Q&A evaluation, GPT-5.2 Pro scored 93.2%, with GPT-5.2 Thinking at 92.4%. On the expert-level mathematics evaluation FrontierMath (Tier 1-3), GPT-5.2 Thinking achieved a 40.3% problem-solving rate.

Researchers reportedly used GPT-5.2 Pro to explore a statistical learning theory problem, where the model proposed a proof that was later verified and peer-reviewed. In terms of factual accuracy, GPT-5.2 Thinking's hallucination rate decreased from 8.8% to 6.2% compared to GPT-5.1. OpenAI advises that critical content still requires manual review.

Key Contributors

While OpenAI typically attributes research progress to the organization, several core team members for GPT-5.2 have been identified through public acknowledgments. These individuals often have backgrounds in mathematics and joined OpenAI in 2023 or 2024, with one joining in 2025.

Notable contributors include:

  • Yu Bai, a Peking University math alumnus and Stanford Statistics Ph.D., who joined OpenAI in May 2024.

  • Yaodong Yu, a UC Berkeley Ph.D. graduate, who joined OpenAI in September 2024.

  • Yufeng Zhang, a mathematics undergraduate from USTC and Northwestern University Ph.D., formerly a researcher at ByteDance, who joined OpenAI in late 2024.

  • Mei Song, a Peking University math alumnus and Stanford Computational and Mathematical Engineering Ph.D., and Assistant Professor at UC Berkeley, who is temporarily leaving academia to join OpenAI in May 2025.

  • Ofir Nachum, an MIT CS Master's graduate and former Google Brain researcher, who joined OpenAI in 2023.