OpenAI Introduces FrontierScience Benchmark to Evaluate AI Scientific Reasoning
OpenAI has launched FrontierScience, a new benchmark designed to assess AI systems' scientific reasoning capabilities in physics, chemistry, and biology using problems typically encountered at the Ph.D. level. The initiative aims to move beyond evaluating AI's ability to recall facts, focusing instead on its capacity for expert-level scientific thought.
The company describes scientific work as a process of continuous trial and error, involving hypothesis generation, experimental design, and the synthesis of information from various fields. OpenAI seeks to determine if AI can apply deep reasoning to contribute to scientific advancement.
Benchmark Rationale and Design
OpenAI noted that its systems have achieved gold medal performance in the International Mathematical Olympiad and the International Olympiad in Informatics over the past year. These models are increasingly used by researchers for interdisciplinary literature searches, cross-language paper analysis, and complex proof generation, shortening tasks that once took days or weeks to mere hours.
The need for FrontierScience stems from the rapid improvement of AI models on existing benchmarks. For instance, GPT-4 scored 39% on GPQA, a scientific question bank designed by Ph.D. experts, in November 2023. Two years later, GPT-5.2 achieved 92% on the same benchmark, surpassing the 74% expert baseline. As older question banks are "solved," new, more challenging metrics are required to gauge further development.
FrontierScience is structured around two types of scientific challenges. One type is competition-oriented, assessing precise reasoning under constraints. The other is research-oriented, requiring navigation of open-ended problems where a single standard answer may not exist.
The benchmark includes over 700 text-based questions, with a "Gold Set" of 160 questions. The competition track comprises 100 questions, emphasizing short answers for straightforward verification. The research track features 60 original sub-tasks, developed by Ph.D. students or senior researchers, and scored on a 10-point scale, with 7 points required for a passing grade.
Question Quality and Evaluation
The quality of the questions is assured through collaboration with experts. The competition track involved 42 former international medalists or national team coaches, collectively holding 109 Olympiad medals. The research track engaged 45 qualified scientists and domain experts across fields such as quantum electrodynamics, synthetic organic chemistry, and evolutionary biology.
OpenAI stated that questions its internal models could already answer correctly were deliberately excluded from both sets, potentially making the evaluation more stringent for its own systems. The "Gold Set" questions for both tracks have been open-sourced, while the remaining questions are retained to monitor data contamination.
For grading, OpenAI used GPT-5 as a model grader for short-answer items, acknowledging that expert grading for every question would be impractical at scale. Grading rules were designed to be objective and machine-checkable, with a verification process to calibrate difficulty and correctness.
Initial Findings and Limitations
Initial evaluations showed GPT-5.2 scoring 77% on competition questions and 25% on research questions, leading other models. Gemini 3 Pro followed closely with 76% on competition questions.
Analysis of failures indicated that frontier models still exhibit reasoning, logical, and computational errors, struggle with obscure concepts, and show factual biases. A notable observation was that longer processing times correlated with higher accuracy.
OpenAI acknowledged the boundaries of FrontierScience, noting that it breaks down scientific research into controllable questions, offering a standardized evaluation but not a comprehensive view of the scientific process. The benchmark does not assess models' ability to propose novel hypotheses or interact with multimodal data and real experimental systems.
OpenAI plans to iterate on the question bank, expand domain coverage, and incorporate more real-world evaluations to understand how these systems empower scientists. The company concluded that while AI excels in problem-solving, it still has a considerable path to becoming an independent, first-class scientist.
