We Released the First Benchmark Dedicated to Category Theory and Formal Verification!
Published:
This page has a Chinese version.
Happy New Year everyone! š At the very start of the new year, Iād like to share our recent release LeanCat-1 (arXiv:2512.24796). To the best of our knowledge, this is the first benchmark suite dedicated specifically to formal proof and category theory.
LeanCat contains 100 statement-level tasks, covering 8 topic clusters such as adjunctions / limitsācolimits / abelian categories / monads, and is designed to pressure-test āabstract interface-layer reasoning + mathlib navigation (library-mediated reasoning)ā. The baseline results are quite brutal: even the strongest model only reaches 8.25% pass@1 / 12% pass@4; Medium/High are essentially near-zero, directly exposing the natural-to-formal bottleneck. We also evaluated LeanBridge: a retrieveāgenerateāverify loop that combines LeanExplore retrieval with Lean compilation/verification feedback, bringing stable improvements on a subset of tasks.
LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories) https://www.alphaxiv.org/abs/2512.24796
GitHub repository https://github.com/sciencraft/LeanCat
Since I barely know faculty and peers in AI for Math, and know even fewer in AI for Science, any shares and comments would be extremely helpful and very welcome.
I believe this work has some ābridgingā significance for the AI for Math and formal proof communities. So what follows is mainly a personal perspective.
During a break after a quantum matter conference this summer, Prof. Luodi remarked to us that doing frontier quantum-physics research does not necessarily require being in an academic instituteāindustry and companies can also provide excellent research environments. Many outstanding researchers at the frontiers of quantum computing and AI hold similar views. Meanwhile, companies often command financial resources far beyond what universities can access.
That sentence has stayed with me ever since. In areas like AI4S, why should scientists still work on scientific problems, instead of handing everything over to industry?
In my view, it is because there are certain things that only scientists will reliably do. Researchers in academic institutions and commercial decision-makers in big tech operate under different value systems. For instance, in practice, big tech cares a great deal about establishing benchmarks based on competitions like IMO and IPhO, training AIs to āclimb leaderboards,ā and then claiming this demonstrates their AIās scientific literacy and intelligence. But competition problems usually reflect very mature scientific knowledge and focus heavily on technique; they help relatively little with frontier scientific questions. Why do companies choose competitions as the yardstick for intelligence? There are many reasons: competition scores are directly rankable and easy for investors and the public to accept; and within the AI industry there are many former IMOers / IOIers, for whom such problems are naturally familiar.
By contrast, research on real scientific problems requires long-term academic accumulation. A typical scenario is that, when discussing industryāacademia collaboration in AI for Science, company representatives (e.g., from ByteDance, Huawei, etc.) and physicists can end up not understanding each otherās internal discussions at all. Academic results are often hard to interpret intuitively, and building high-quality datasets is difficult. For these reasons, industry tends to invest less in basic mathematical and physical sciences, and instead prioritizes topics with clear, observable returns: in biopharmaceuticals, for example, optimizing each step in pipelines such as large-molecule DFT can translate into enormous profits; today, Tsinghua University published its first Science paper of the new year titled (in Chinese) āAI enables million-fold acceleration of virtual drug screening.ā Similar stories hold for materials, energy, and chemical engineeringārather than basic science, which often appears to have limited immediate economic payoff.
Yet, as Professor Shing-Tung Yau has said, āThe foundation of technology is science.ā The most cutting-edge breakthroughs in technology ultimately rely on basic science. Precisely for that reason, teaching AI what scientific research is becomes especially important. This is what I mean by āthings only scientists will doāātraining and educating AI to serve the scientific enterprise.
Along this direction, I have seen many seniors and peers whom I deeply admire devote themselves to related efforts, and I feel a strong resonance with their work:
- Humanityās Last Exam (arXiv:2501.14249)
- PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models (arXiv:2504.16074)
- Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark (arXiv:2509.26574)
- CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers (arXiv:2510.05228)
- QMBench: A Research Level Benchmark for Quantum Materials Research (arXiv:2512.19753)
⦠and so on. A closely related area is the development of scientific agents. In these directions, there are not only seniors, but also peers and even younger researchers who are extraordinarily impressive:
- physmaster: Building an Autonomous AI Physicist for Theoretical and Computational Physics Research (arXiv:2512.19799)
I have long advocated the view that āyoung scholars should do things that are of our era.ā That belief is the backdrop of much of my advice to younger students. Then, as a young student myself, a natural question is: what can I do for AI4S? That question ultimately led to the birth of this work. Seeing the paper successfully appear online makes all the intense preparation and day-and-night effort feel worthwhile.
Why Category Theory?
Compared with mature mathematical subjects like linear algebra and calculusāfields with abundant data and relatively straightforward verificationācategory theory, especially higher category theory, not only stands at the frontier of formal mathematical sciences, but in some sense even exceeds what unaided human cognition can comfortably manage.
Elementary category theory attracts many young people with its concise elegance. If one only cares about functoriality, it can feel almost like working with finite sets and maps. But the value of category theory goes far beyond that. To borrow Master Kong Liangās phrasing, the essence of category theory is ācomputing with structure.ā This makes it naturally compatible with many branches of mathematics. But deeper research demands far more computation and far more delicate lemmas.
For physicists, the rapid progress in topological phases of matter over the past decade has brought in a great deal of new mathematicsāamong which category theory is one of the most representative. To understand topological order in four-dimensional spacetime, one essentially needs 2-categories. We learned angular momentum in elementary quantum mechanics; advanced quantum mechanics sometimes mentions its relation to classical 3j symbols. Three-dimensional topological order requires quantum 6j symbols, and four-dimensional topological order can even require quantum 15j symbols and 20j symbols. The associated computations are massiveāoften beyond what a human brain can hold.
Faced with this challenge at the edge of human intellectual capacity in frontier mathematical physics, a natural thought arises: can we let AI do the computation? This is one major motivation behind our work.
It is also worth noting that, over the past decade, the introduction and application of new mathematics in topological physics has been striking. Over the previous fifty years, string theory (and before that, perturbative QFT) introduced vast new mathematical machinery into formal theoretical physics. Now, might arenas such as the bootstrap, topological phases, symmetry, CFT, phase transitions, and critical phenomena become the forward outpost of new mathematics for our era?
(An intriguing viewpoint is that gravitational physics is deeply intertwined with topological physics. This perspective first arose from the loop-quantum-gravity community, and in recent developments in topological physics it has gained many implicationsāespecially in low dimensions, where fairly rigorous relationships have almost been established.)
Why Lean and Formal Proof?
Accordingly, we turned to Leanāone of the most mature platforms for formalized proof. The basic idea is: if the code type-checks and runs, the proof is correct, so proofs can be verified entirely at the machine level. The significance of this for mathematical proof is self-evident.
Historically, verifying major theorems and conjectures has consumed immense community effort: Perelmanās proof of the 3D PoincarĆ© conjecture, Wilesā proof of Fermatās Last Theorem, Mochizukiās claimed proof of the ABC conjecture via IUT theory, the Four Color Theorem, and so on. Even recently, there have been major claims still under intense scrutinyāe.g., verification efforts organized by institutions such as BICMR for certain announced resultsāand to this day the verification status of the classification of finite simple groups still feels, in practice, nontrivial to treat as completely āclosed.ā
Objectively, this reflects the importance of the Lean ecosystem itself and what formal verification can mean for mathematics. Subjectively, it also reflects our own prior accumulation in Lean and formal proof. Our team includes experts in category theory and mathematical physics; and this summer, Ronggeās team won the sole first prize at an AI4S hackathon hosted by Tsinghua Universityās IIIS (Institute for Interdisciplinary Information Sciences) and IASTU (Institute for Advanced Study, Tsinghua University). The project was called āLeanBridgeā, building a bridge between LLMs and Lean for formal proof. Some introductions (in Chinese) can be found here:
Next Steps�
Over the past three years, the revolution of large language models has reshaped society. From everyday conversations to the ivory tower, everyone asks whether AI can add value to their work.
I strongly agree with a remark by Andrew Ng: deep learning is one circle, oneās own career is another circle, and their intersection (a Venn diagram) is where AI can truly help. For scientific research, where are the boundaries of AIās assistance? Or is that intersection continually expanding?
I believe exploring the boundary of the intersection between AI and the scientific enterprise will be one of the most important themes of this era. A year ago, I might have thought AIās contributions to science would mainly come from LLMs, agent systems, and symbolic reasoning. But the past yearās progress has gone far beyond that. Papers that are almost independently completed by AI have begun to appear: yesterday, the renowned Harvard particle physicist Matthew Schwartz uploaded to arXiv a preprint that was generated and written via AI-driven reasoning and computation, with human physicists mainly serving as supervisors. More and more theorists now explicitly acknowledge AI assistants in the acknowledgments of their papers.
So what is the next AI hotspot? My guess is embodied intelligence, world models, and MCP. All three are, without exception, about how to teach AI to understand the physical world. In this way, AI can step out of the screen and genuinely participate in scholarly and industrial practice in the physical world, further realizing the idea that tools are extensions of human agency. Prof. Xiaoliang holds a similar view: since training AI models has become relatively mature, the next step is to deploy them into real production and daily life through MCP.
Acknowledgments and Closing
We thank the team of Prof. Jian Li at the School of Science, Westlake University, for hosting us: the final push of this work was completed there. I genuinely like Hangzhou as a city, and this was my first visit to Westlake University. As a small but elite emerging private university, Westlake left a deep impression on me in many ways: an outstanding faculty, responsible administration, excellent students and mentors, abundant research resources, and a strong academic atmosphere.
I personally love gentle sunshine, and winters in Jiangnan always leave me with warm memories. On my first visit to West Lake, discussions with Prof. Jian Li, Prof. Wei Zhu, Prof. Guang Hong, Prof. Chong Wang, and senior Jianhao benefited me greatly; what I learned will stay with me for a lifetime. I also thank the wonderful students at Westlakeās School of Science for their hospitalityāspending time with them is a very pleasant memory.
Finally, we owe our deepest thanks to Rongge, whose dedication accounts for the vast majority of this workās credit. Rongge is a student of Prof. Jian Li and Master Kong Liang, rigorously trained in topological phases of matter and category theory, and is now an outstanding postdoc at the Qiu Center. She is technically solid and full of ideas; without her sustained effort and relentless drive, this paper would not have seen the light of day.
I would like to close with a passage from the end of The Princeton Companion to Mathematicsāa passage that appears before Atiyahās advice to young mathematicians. I recommend that all students working in mathematical physics read this book:
When computers first appeared in the mathematiciansā world, the nearly unanimous reaction was that they would never be useful for proving theorems⦠Yet the other end of the rainbow may be a far deeper role for computers. In a small number of areas of mathematics, we have already been able to do this⦠The road to this magnificent new world is still long, and no one has yet explored it.
Jan 9, 2026