OpenAI Introduces GPT-5.2 Models, Achieving "Human Expert Level" in Knowledge Work

Victor Zhang
Victor Zhang
A stylized, glowing brain icon with circuit board patterns, representing advanced AI and knowledge processing, against a dark, futuristic background.

OpenAI has unveiled its GPT-5.2 series, including GPT-5.2 Instant, GPT-5.2 Thinking, and GPT-5.2 Pro, marking what the company describes as its most powerful models to date for professional knowledge work. The announcement coincides with OpenAI's tenth anniversary.

The new models demonstrate significant advancements across various benchmarks, with GPT-5.2 Thinking achieving a 100% score on AIME 2025 (Mathematics) compared to Gemini 3 Pro's 95%. In abstract reasoning (ARC-AGI-2), GPT-5.2 Thinking scored 52.9%, surpassing Gemini 3 Pro's 31.1%. For coding (SWE-bench pro), it reached 55.6% against Gemini 3 Pro's 43.3%.

GPT-5.2 scored 74.1% on the GDPval (Knowledge Work) benchmark, a result OpenAI states is the first time an AI model has reached "human expert level." Sam Altman, OpenAI's CEO, characterized GPT-5.2 as the company's most substantial upgrade in a considerable period.

Enhanced Professional Capabilities

GPT-5.2 Thinking is designed for real-world professional scenarios. In the GDPval evaluation, which covers 44 professions and specific knowledge work tasks, GPT-5.2 Thinking set a new SOTA record. Professional reviewers found that GPT-5.2 Thinking either matched or outperformed top industry experts in 70.9% of comparisons for tasks such as creating presentations and spreadsheets.

The model generated outputs over 11 times faster than human experts at less than 1% of the cost during GDPval tasks. These tasks include creating sales presentations, accounting statements, emergency room schedules, manufacturing flowcharts, and short videos.

In internal spreadsheet modeling benchmarks for junior investment banking analysts, GPT-5.2 Thinking's average task score increased by 9.3% compared to GPT-5.1, rising from 59.1% to 68.4%. This indicates improved professionalism and layout quality in generated spreadsheets and presentations. For instance, in a workforce planning model prompt, GPT-5.2 accurately completed all calculations, unlike GPT-5.1, which made errors in liquidation preferences and formula placement.

Access to new spreadsheet and presentation generation features in ChatGPT requires a Plus, Pro, Business, or Enterprise plan, with users selecting GPT-5.2 Thinking or GPT-5.2 Pro.

Coding and Reduced Hallucinations

GPT-5.2 Thinking achieved a new SOTA score of 55.6% on SWE-Bench Pro, a benchmark for real-world software engineering capabilities that covers four programming languages. On SWE-bench Verified, it reached 80%. This translates to more reliable performance in debugging production code, implementing feature requests, refactoring large codebases, and performing end-to-end fixes with less human intervention. The model also shows enhanced front-end engineering capabilities, particularly for complex UI designs.

OpenAI reports a 30% reduction in incorrect answers from GPT-5.2 Thinking compared to GPT-5.1 Thinking in a set of de-identified real user queries from ChatGPT. This aims to improve reliability for professional users in research, writing, analysis, and decision support.

Context and Visual Understanding

GPT-5.2 Thinking demonstrates advanced long-context reasoning, achieving leading performance on OpenAI MRCRv2, an evaluation for integrating scattered information from long documents. It is OpenAI's first model to achieve nearly 100% accuracy on the 4-needle MRCR variant (up to 256k tokens), enabling professionals to process ultra-long documents while maintaining coherence and accuracy.

In visual understanding, GPT-5.2 Thinking shows almost halved error rates in chart reasoning and software interface understanding. This allows for more accurate interpretation of dashboards, product screenshots, technical diagrams, and visual reports. The model exhibits a more thorough understanding of positional relationships of elements within an image, crucial for tasks where relative layout is key.

Tool Calling and Scientific Research

GPT-5.2 Thinking achieved a SOTA score of 98.7% on Tau2-bench Telecom, indicating its ability to reliably use tools in long-chain, multi-turn tasks. This supports more powerful end-to-end workflows, such as resolving customer support cases, extracting data, and generating outputs. For example, it can manage complex customer service questions requiring multiple steps, like rebooking flights and arranging special seating.

OpenAI states that GPT-5.2 Pro and GPT-5.2 Thinking are currently the best models for assisting scientific research. GPT-5.2 Pro scored 93.2% on GPQA Diamond (a graduate-level question-answering benchmark), with GPT-5.2 Thinking at 92.4%. In FrontierMath (Tier 1–3), GPT-5.2 Thinking solved 40.3% of expert-level mathematics problems.

General Reasoning and Availability

On ARC-AGI-1 (Verified), GPT-5.2 Pro is the first model to exceed 90%, and on ARC-AGI-2 (Verified), GPT-5.2 Thinking achieved 52.9%, with GPT-5.2 Pro reaching 54.2%. These improvements reflect stronger multi-step reasoning, higher quantitative accuracy, and more reliable problem-solving for complex technical tasks.

OpenAI is gradually rolling out GPT-5.2 (Instant, Thinking, and Pro versions) in ChatGPT, initially for paying users. GPT-5.2 Thinking is available on the API platform as gpt-5.2, GPT-5.2 Instant as gpt-5.2-chat-latest, and GPT-5.2 Pro as gpt-5.2-pro. Developers can set reasoning parameters in GPT-5.2 Pro and access a new 'xhigh' difficulty level for tasks requiring extreme quality.

OpenAI states that while the cost per token for GPT-5.2 is higher, its increased token efficiency results in a lower total cost to achieve a specific quality level. ChatGPT subscription prices remain unchanged. OpenAI plans to release a Codex-optimized version of GPT-5.2 in the coming weeks.

Safety and Anniversary Reflections

GPT-5.2 incorporates OpenAI's "safety completion" research, enhancing the model's responsiveness in sensitive conversations and reducing undesirable responses related to topics such as suicide or self-harm. The model can automatically apply content protection for users under 18.

On OpenAI's tenth anniversary, CEO Sam Altman reflected on the company's journey, stating that its achievements have far exceeded initial expectations. He highlighted the evolution from early research in reinforcement learning and language models to the widespread adoption of ChatGPT and GPT-4. Altman expressed optimism for future advancements, anticipating the creation of superintelligence within the next decade.

GPT-5.2 was developed in collaboration with NVIDIA and Microsoft, utilizing Azure data centers and NVIDIA GPUs, including H100, H200, and GB200-NVL72.