Three AI Teams Win IMO Gold; OpenAI Talks About How They Did the Math
OpenAI announced first, but DeepMind and Harmonic also got to gold-level this year, each in their own way.
In just two months, a scrappy three-person team at OpenAI sprinted to fulfill what the entire AI field has been chasing for years—gold-level performance on the International Mathematical Olympiad problems. Alex Wei, Sheryl Hsu and Noam Brown discuss their unique approach using general-purpose reinforcement learning techniques on hard-to-verify tasks rather than formal verification tools. The model showed surprising self-awareness by admitting it couldn’t solve problem six, and revealed the humbling gap between solving competition problems and genuine mathematical research breakthroughs.
AI math has made amazing progress in the past year, in large part through new approaches to augmenting LLMs with both scale and symbols. Here’s a bonus essay on how OpenAI, DeepMind and Harmonic matched the top 95th percentile of human contestants this year (up from one team last year, DeepMind, that matched the top 83rd percentile for silver).
The 2025 IMO Winner's Circle: How Three AI Teams Got Gold in Very Different Ways
OpenAI, DeepMind and Harmonic each found their own path to mathematical gold.
Jul 30, 2025
Post methodology: @Claude-4-Sonnet via Dust: please read [training data transcript, OpenAI x thread, Noam Brown LinkedIn post, DeepMind blog, Harmonic blog, harmonic livestream transcript, IMO proofs (3)] and write an essay on the topic: "The IMO Winner’s Circle: How Three AI Teams Got Gold This Year in Very Different Ways."; discuss the differences in how the teams approached this year's competition; how they solved each problem (and what tripped them all up on problem 6); and what this means for the future of AI math. (Note: GPT-4 and Gemini Pro prompted but not used) Light editing and formatting for Substack editor.
The 2025 International Mathematical Olympiad marked a historic moment in artificial intelligence: for the first time, not one but three AI systems achieved gold medal-level performance in the world's most prestigious mathematics competition. OpenAI, Google DeepMind, and Harmonic each took radically different approaches to reach this milestone, offering compelling insights into the diverse paths toward mathematical superintelligence.
Three Roads to Gold
OpenAI: The General-Purpose Reasoning Revolution
OpenAI's achievement was perhaps the most surprising in terms of timeline. As Alexander Wei noted on X, just 15 months earlier their models scored only 12% on AIME problems. Their breakthrough came through what they call "general-purpose reinforcement learning and test-time compute scaling."
The OpenAI team, consisting of just three core researchers (Alexander Wei, Sheryl Hsu, and Noam Brown), focused on developing techniques that could work across domains rather than building a math-specific system. Their approach prioritized what Noam Brown described as "general purpose techniques" that could be applied beyond mathematics.
Key to their success was scaling inference-time compute—allowing the model to "think" for extended periods. As the team explained on the Training Data podcast, they moved from problems taking seconds (GSM8K) to those requiring 100+ minutes of sustained reasoning (IMO level). Their model solved 5 out of 6 problems, earning 35 points and achieving clear gold medal status.
Notably, OpenAI's model operated entirely in natural language, producing multi-page proofs that, while stylistically unusual ("atrocious" in Wei's words), were mathematically sound. The team deliberately avoided formal verification tools like Lean, focusing instead on the harder challenge of reasoning in natural language—a capability that transfers more broadly to other domains.
Related: OpenAI IMO Team’s first interview after the competition
Google DeepMind: From Formal to Natural Language
DeepMind's journey represents a fascinating evolution in approach. In 2024, they achieved silver medal performance using AlphaProof and AlphaGeometry 2—specialized systems that required human experts to translate problems into formal languages like Lean, then translate the AI's solutions back to English.
This year marked what Thang Luong called a "big paradigm shift." Their advanced version of Gemini with Deep Think operated entirely end-to-end in natural language, solving 5 out of 6 problems for 35 points—matching OpenAI's score.
DeepMind's Deep Think system incorporated "parallel thinking," allowing the model to simultaneously explore multiple solution paths before converging on a final answer. This represented a significant advancement over linear chain-of-thought reasoning. The system was enhanced with reinforcement learning techniques specifically designed for multi-step reasoning and theorem-proving.
Crucially, DeepMind's solutions completed within the official 4.5-hour competition time limit and were officially graded by IMO coordinators—the same judges who evaluated human participants. IMO President Prof. Dr. Gregor Dolinar praised their solutions as "clear, precise and most of them easy to follow."
Related: DeepMind’s Pushmeet Kohli on AlphaEvolve
Harmonic: The Formal Verification Advantage
Harmonic took perhaps the most technically rigorous approach with their Aristotle system. While the other teams focused on natural language solutions that required human verification, Harmonic built formal verification directly into their system using the Lean4 proof assistant.
This approach solved what Vlad Tenov calls "the verification problem"—as AI generates more mathematical content, human verification becomes a bottleneck. Aristotle's solutions were automatically verified down to foundational axioms, providing machine-checkable guarantees of correctness without requiring human mathematicians to spend time checking proofs.
Harmonic's system also solved 5 out of 6 problems, but uniquely, their solutions came with formal proofs that were immediately verifiable. As Tudor Achim noted in their livestream announcement on X, "when our system outputs the proof, nobody has to look at it. You know it's correct."
The company's approach extended beyond competition mathematics—they're building toward verified AI that can work across physics, software engineering, and other domains where verification is crucial.
Related: Tudor Achim and Vlad Tenov on Training Data in 2024
The Problem That Stumped Them All: Problem 6
Interestingly, all three teams failed to solve Problem 6, a combinatorial construction problem. This wasn't random—Problem 6 highlighted specific limitations in current AI mathematical reasoning.
As OpenAI's Alex Wei explained, Problem 6 was "a really tough problem" requiring "leaps of insight" that are still challenging for current models, which are more adept at problems that can be broken down into a series of smaller, more manageable steps.
The problem involved recognizing that 2025 is a perfect square (45²) and then constructing a specific tiling pattern. Harmonic's team noted that the solution was literally printed on the floor of the Sunshine Coast Airport where the competition was held—highlighting how visual and constructive insights remain challenging for AI systems.
Problem 6 also revealed the continued difficulty AI has with combinatorics. Gemini’s symbolic engine explored an enormous search tree, but the branching factor was simply too high. Even with neural pruning, it could not find a proof within the allowed timeframe. Similarly with Aristotle,the Lean formalization forced the model to explicitly justify every inference, slowing progress and causing the system to run out of time before a complete proof emerged. The OpenAI team generated promising partial progress, but struggled to construct a fully general proof and also eventually hit a “wall” of combinatorial explosion.
The fact that all three models failed on the same problem suggests that while AI has made enormous strides in logical reasoning, the kind of creative, out-of-the-box thinking that characterizes the most brilliant human mathematicians remains a frontier to be conquered.
Yet, there was a silver lining in this failure. OpenAI's model, rather than generating a flawed or nonsensical answer, indicated that it could not solve the problem. This display of self-awareness is a significant advancement, moving AI from a state of "convincing but wrong" to a more honest and reliable collaborator.
What This Means for the Future of AI Math
The Scaling of Reasoning Time
All three approaches demonstrated the power of scaling inference-time computation. OpenAI's progression from seconds to hours of reasoning time, DeepMind's parallel thinking, and Harmonic's iterative formal verification all point toward a future where AI systems can engage in extended mathematical reasoning.
As Noam Brown noted, this creates new challenges: "if you have the model thinking for like 1,500 hours, then in order to eval it, you have to have it think for 1,500 hours." The evaluation bottleneck may become a limiting factor as reasoning times scale.
The Verification Challenge
The three teams' different approaches to verification highlight a fundamental question about AI mathematics. OpenAI and DeepMind achieved impressive results with natural language reasoning that requires human verification. Harmonic chose the more technically demanding path of formal verification.
As mathematical AI systems become more capable, the verification problem will become increasingly critical. Human mathematicians already struggle to verify complex proofs—Andrew Wiles' proof of Fermat's Last Theorem took two years to fully verify. As AI generates more mathematical content, automated verification may become essential.
Generality vs. Specialization
The approaches also reflect different philosophies about AI development. OpenAI explicitly prioritized general-purpose techniques that could transfer to other domains. DeepMind evolved from specialized formal systems toward more general natural language reasoning. Harmonic focused on the specific challenge of verified mathematical reasoning.
This tension between generality and specialization will likely continue as AI systems tackle increasingly complex mathematical problems. The most successful future systems may combine the best of all approaches—general reasoning capabilities with domain-specific optimizations and formal verification where needed.
The Path to Research Mathematics
All three teams acknowledged that significant challenges remain before AI can contribute to research mathematics. The gap between 1.5-hour competition problems and research problems requiring months or years of work remains vast.
However, the rapid progress is undeniable. As Wei noted, they progressed from 12% on AIME to IMO gold in just 15 months. The mathematical reasoning capabilities demonstrated this year suggest that AI assistance for research mathematicians may arrive sooner than expected.
Conclusion
The 2025 IMO results represent three viable but distinct paths toward mathematical AI. OpenAI's general-purpose reasoning, DeepMind's evolved natural language approach, and Harmonic's formal verification each offer unique advantages and face different challenges.
The failure of all three systems on Problem 6 reminds us that significant challenges remain, particularly in combinatorial reasoning and creative mathematical insight. However, the diversity of successful approaches suggests that the field is rapidly maturing, with multiple promising directions for future development.
As these systems continue to improve, they're likely to become invaluable tools for mathematicians, scientists, and researchers. The question is no longer whether AI will achieve mathematical superintelligence, but which combination of approaches will prove most effective for advancing human mathematical knowledge.
The race to mathematical superintelligence has begun in earnest, and 2025 will be remembered as the year when AI definitively entered the winner's circle.