What Stanford’s HAI Report Says About AI in Science
Progress in artificial intelligence continues to accelerate across a range of expert disciplines, according to the latest AI Index report published this week by Stanford University’s Human-Centered Artificial Intelligence (HAI) center. When it comes to science, math, and reasoning, several frontier AI models now meet or exceed human baselines on PhD-level questions. However, there are gaps in the AI model coverage, and limitations remain in how these AI models can be applied in the real world.
The Stanford HAI Center’s AI Index reports are valuable because they gather hard data about AI models running the real world, as opposed to only asking people for their opinions (which HAI also does). For 2026, HAI looked into published benchmark results for a range of AI models and found that they continue to improve at an astounding rate.
For instance, the researchers found that Frontier models gained 30 percentage points in a single year on Humanity’s Last Exam, which is a benchmark composed of questions from nearly 1,000 subject-matter experts, primarily professors, researchers, and graduate degree holders. Humanity’s Last Exam was designed to really put AI models through their paces, but the models are getting so good that evaluations that were intended to be challenging for years instead are being completed in months, HAI says in its report (which you can access here).
The size of AI models continues to grow (Source: Stanford HAI AI Index 2026)
The top six AI models–which are from Anthropic, xAI, Google, OpenAI, Alibaba, and DeepSeek–have converged in capability in early 2026, per the Arena Leaderboard, HAI reports. Meta now resides outside of the top tier of models and has shown no improvement over the past 22 months on that benchmark. In general, open models like Meta’s Llama are doing worse than closed models like OpenAI, according to HAI. The spread between the top closed model and the top open model went from 0.3% in August 2024 to 3.3% in March 2026.
“AI capability is not plateauing. It is accelerating and reaching more people than ever,” the authors of the AI Index wrote. “Industry produced over 90% of notable frontier models in 2025, and several of those models now meet or exceed human baselines on PhD-level science questions, multimodal reasoning, and competition mathematics. On a key coding benchmark–SWE-bench Verified–performance rose from 60% to near 100% of meeting the human baseline in a single year. Organizational adoption reached 88%, and 4 in 5 university students now use generative AI.”
When it comes to science, AI models continue to rack up big gains. However, their overall usefulness is mixed.
For instance, HAI says frontier models generally now outperform human chemists, as shown by ChemBench, a benchmark designed to evaluate the chemical knowledge and reasoning capabilities of LLMs. According to HAI, the best AI models now surpass human expert averages across more than 2,700 chemistry questions on ChemBench. HAI also mentioned the launch of Polymathic’s AION-1, the first foundation model for astronomy, and pointed out the big advances made in weather forecasting with FourCastNet 3 global weather model and Aardvark Weather’s AI forecaster developed by the University of Cambridge.
HAI also points out that the first fully AI-generated paper was accepted at a peer-reviewed workshop in 2025. Sakana’s AI Scientist-v2 model produced a paper that was accepted at an ICLR workshop without any human-coded templates; that paper has since been accepted for publication in the journal Nature. Google’s AI Co-Scientist was validated in three biomedical areas, HAI says.
Despite these advances, there are still holes in AI’s scientific repertoire, including the capability to recreate scientific studies.
AI models meet or exceed PhD-level performance on a range of general tasks… (Source: Stanford HAI AI Index 2026)
HAI points out that frontier models score below 20% on paper-scale replication in astrophysics on ReplicationBench, a framework introduced in 2025 by Stanford and University of Toronto researchers to evaluate the validity of AI-assisted scientific research in astrophysics. HAI also points out that LLM agents answer earth observation questions with 33% accuracy on UnivEarth, a benchmark created for measuring reliability of AI-assisted research in Earth Observation (EO) and geospatial analysis. What’s more, LLM agents’ code fails 58% of the time on UnivEarth.
The capability for science LLM agents to handle end-to-end tasks is also not quite up to par. HAI points out that the best agent reaches 38.8% accuracy on the PaperArena evaluation tool introduced last year by Cornell University researchers, versus a PhD expert baseline of 83.5%. Frontier models achieve roughly 17% accuracy on real-world bioinformatics analysis as measured by BixBench, a benchmark for computational biology introduced last year.
AI is also making gains in medicine, which occupies a full chapter in the AI index. Thanks to broad improvements in AI transcription accuracy, physicians are spending up to 83% less time writing patient notes after visits. That’s having a meaningful impact in reducing burnout, the report notes. AI is also showing some skill in diagnosing disease, as demonstrated by Microsoft’s AI Diagnostic Orchestrator, which utilizes OpenAI’s o3 and scored 85.5% accuracy on a test of complex published case studies. By comparison, “unaided physicians” (which means they did not have access to their “usual tools”) scored only 20%.
….but AI models are not meeting human-level baselines on benchmarks like PaperArena that measure end-to-end scientific workflows (Source: Stanford HAI AI Index 2026)
There is a shift to smaller models in molecular biology, according to the AI Index. HAI points to the report of MSA Pairformer, a 111-million parameter protein language model, outperforming the previous leaders on the ProteinGym benchmark despite having two orders of magnitude fewer parameters. It also pointed out that GPN-Star, a 200-million-parameter genomics model, outperformed a model with 40 billion parameters.
While AI has come a long way, there are still some gaps, which contribute to the “jagged frontier” problem with AI. For instance, there’s also the odd problem that AI models can’t reliably tell time. According to Stanford HAI, the top model can read analog clocks correctly just 50.1% of the time.
And Hallucinations continue to be a problem. GPT-4o’s accuracy dropped from 98.2% to 64.4%, while DeepSeek R1 fell from around 90% to 14.4%. “When a false statement is presented as something another person believes, models handle it well,” the AI Index authors write. “When the same false statement is presented as something a user believes, performance collapses.”
You can download a copy of Stanford HAI AI Index 2026 here.
This article first appeared on HPCwire.
Related

