Covering Scientific & Technical AI

ANL’s Rick Stevens Puts AI Agents to the AI-for-Science Test

Have we reached a point where AI agents can reliably function as scientific collaborators? Can they go one step further and work as autonomous scientists?

Stevens is an Associate Laboratory Director for the Computing, Environment and Life Sciences (CELS) Directorate at ANL and a Distinguished Fellow at the laboratory. He is also a Professor of Computer Science at the University of Chicago.

At his TPC26 keynote, he outlined how he used large-scale experiments involving scientific paper replication and model benchmarking to better understand what it would take to accelerate scientific discovery using autonomous AI agents.

Stevens not only wanted to measure the model performance, but also gauge the practical requirements of deploying AI agents at scale. This includes testing the coordination mechanisms and the resources needed to support increasingly complex scientific workflows.

Rick Stevens delivers his address at TPC26

One of the most ambitious parts of his experiment was teaching AI agents how to replicate scientific papers. While reproducing existing research may sound less exciting than making new discoveries, Stevens argued that replication provides a practical way to measure both the capabilities and limitations of today’s AI systems.

“The basic goal here is to hand the paper to the agent and tell it to do everything it can to replicate the paper. So read the paper, build a table of what the principal ideas were, the principal tools, the hypotheses, the assumptions, and then do a parallel implementation,” Stevens said.

“And this was this is pretty interesting to see how this fails, how it works and how it fails, but it’s also a basic building block of doing science, right? And we’re trying to collect information on the throughput and resources needed to do this and what kind of resources are needed.”

The project now includes approximately 100 papers and requires agents to understand scientific methods, identify the necessary tools and datasets, and reproduce published findings. Along the way, the agents are generating new research questions and helping Stevens estimate what it might take to eventually scale AI driven science beyond replication and toward original discovery.

One of the most important findings from the project was that AI agents proved capable of reproducing a meaningful portion of scientific work. Stevens’ experiment evaluated each replication attempt using measures such as coverage and agreement with the original results. Across the papers evaluated so far, agents achieved average scores of roughly 7.5 for coverage and 8 for agreement. More than half scored above 8 on both measures.

Performance varied significantly depending on the type of research. Mathematical papers, theoretical derivations, and studies built around open source software and accessible datasets generally produced the strongest results. In some cases, agents were even able to improve upon published findings by achieving lower error rates than those reported in the original work.

Stevens said the strongest predictor of successful replication was whether authors made their code publicly available.

The project also revealed important limitations. Agents struggled when papers relied on proprietary software and inaccessible datasets. They also did not do well with poorly documented methods or physical experiments.

Stevens observed that many scientific papers contain tacit assumptions that are never explicitly documented, making them difficult to reproduce accurately.

Despite those challenges, the results were encouraging enough to push the project beyond simple replication. The agents are now generating follow up research questions from the papers they analyze and laying the groundwork for future experiments focused on original scientific discovery.

The results also allowed Stevens to begin estimating the resources required to scale AI driven scientific workflows. What started as an experiment in paper replication quickly evolved into a broader effort to understand the infrastructure needed to support large numbers of scientific agents.

“That’s really interesting because it allows us to project the resource requirements that we’re gaining from the replication project into, if you wanted to accelerate science to new and open problems, how much resource might be needed.”

The project was also used to estimate what it would take to scale agent based science further. Replicating 1,000 scientific papers in 10 days would require hundreds of parallel agents, roughly 200,000 GPU hours, millions of CPU hours and hundreds of terabytes of storage. He described the exercise as a way to understand the infrastructure requirements for future AI driven scientific discovery.

Stevens said the team is using replication as a baseline to estimate the effort required for original research. Early results suggest that pursuing new discoveries may require 10 to 30 times more resources than reproducing existing work – depending on the complexity of the problem.

If replication is the first step toward autonomous discovery, Stevens’ project offers an early glimpse into both the promise and the hurdles of building AI systems capable of accelerating science.

QCWire Graphic

Broadcom Announces VMware Cloud Foundation 9.1

PALO ALTO, Calif., May 5, 2026 — Broadcom Inc., a global technology leader that designs,…

Yann LeCun’s AMI Secures $1B Seed to Develop AI World Models

Turing Award–winning AI researcher Yann LeCun has spent years arguing that large language models, at…

NVIDIA Announces Financial Results for 1st Quarter Fiscal 2027

SANTA CLARA, Calif., May 21, 2026 — NVIDIA (NASDAQ: NVDA) has reported record revenue for…

Anthropic Unveils ‘Project Glasswing’ as Claude Mythos Targets Software Vulnerabilities

April 9, 2026 — Anthropic has announced Project Glasswing, a new initiative that brings together Amazon…

Google Unveils Gemini Enterprise Agent Platform, Expands Vertex AI into Full Agent Stack

At Google Cloud Next 2026 in Las Vegas this week, the company announced major changes…

OpenAI Shutters Sora, Shifts Business Strategy Ahead of IPO

Back in 2022, OpenAI set off a chain reaction in the tech world when it…

ANL’s Rick Stevens Puts AI Agents to the AI-for-Science Test

Have we reached a point where AI agents can reliably function as scientific collaborators? Can…

LF AI & Data Foundation Launches DocLang Specification Working Group

New specification, supported by leading LF AI & Data member organizations IBM and Red Hat,…

WEKA Reports 10x Higher AI Inference Throughput with NeuralMesh on OCI

CAMPBELL, Calif., June 9, 2026 — WEKA today announced production-scale benchmarks that show how organizations…

Nebius Chooses Kao Data’s Harlow Campus for Major AI Infrastructure Deployment

LONDON, June 9, 2026 — Kao Data, a specialist developer and operator of data centers…

SK Telecom and NVIDIA Build AI Infrastructure to Power Korea’s AI Innovation

Korea’s Leading Telco to Add NVIDIA-Powered AI Cloud Capacity Built on NVIDIA DSX AI Factory…

Accenture and Carnegie Mellon SEI Unveil Framework to Measure and Advance AI Maturity

NEW YORK, June 8, 2026 — Accenture and the Carnegie Mellon University Software Engineering Institute (SEI)…

Source link

What's Hot

Ericsson and Dubai Digital Authority Forge Partnership on Connectivity and Digital Innovation

Marvel Legends Series X-Men 97 Cyclops Visor

DJI sues Insta360 for how similar the Luna Ultra is to the Osmo Pocket 4P

Covering Scientific & Technical AI

Broadcom Announces VMware Cloud Foundation 9.1

Yann LeCun’s AMI Secures $1B Seed to Develop AI World Models

NVIDIA Announces Financial Results for 1st Quarter Fiscal 2027

Anthropic Unveils ‘Project Glasswing’ as Claude Mythos Targets Software Vulnerabilities

Google Unveils Gemini Enterprise Agent Platform, Expands Vertex AI into Full Agent Stack

OpenAI Shutters Sora, Shifts Business Strategy Ahead of IPO

ANL’s Rick Stevens Puts AI Agents to the AI-for-Science Test

LF AI & Data Foundation Launches DocLang Specification Working Group

WEKA Reports 10x Higher AI Inference Throughput with NeuralMesh on OCI

Nebius Chooses Kao Data’s Harlow Campus for Major AI Infrastructure Deployment

SK Telecom and NVIDIA Build AI Infrastructure to Power Korea’s AI Innovation

Accenture and Carnegie Mellon SEI Unveil Framework to Measure and Advance AI Maturity

Meta Employees Absolutely Hate Mark Zuckerberg’s Plan for a Companywide AI Hackathon

Inside interoception: The hidden sense of how you feel inside

Covering Scientific & Technical AI

China Didn’t Make Americans Hate Data Centers

iPhone Pro 13 Rumored to Feature 1 TB of Storage

Oculus Quest X Headset: Discover a Shining New Star

Fujifilm’s 102-Megapixel Camera is the Size of a Typical DSLR

Review: Mi 10 Mobile with Qualcomm Snapdragon 870 Mobile Platform

Comparison of Mobile Phone Providers: 4G Connectivity & Speed

Which LED Lights for Nail Salon Safe? Comparison of Major Brands

Subscribe to Updates

What's Hot

Covering Scientific & Technical AI

ANL’s Rick Stevens Puts AI Agents to the AI-for-Science Test

Related

Broadcom Announces VMware Cloud Foundation 9.1

Yann LeCun’s AMI Secures $1B Seed to Develop AI World Models

NVIDIA Announces Financial Results for 1st Quarter Fiscal 2027

Anthropic Unveils ‘Project Glasswing’ as Claude Mythos Targets Software Vulnerabilities

Google Unveils Gemini Enterprise Agent Platform, Expands Vertex AI into Full Agent Stack

OpenAI Shutters Sora, Shifts Business Strategy Ahead of IPO

ANL’s Rick Stevens Puts AI Agents to the AI-for-Science Test

LF AI & Data Foundation Launches DocLang Specification Working Group

WEKA Reports 10x Higher AI Inference Throughput with NeuralMesh on OCI

Nebius Chooses Kao Data’s Harlow Campus for Major AI Infrastructure Deployment

SK Telecom and NVIDIA Build AI Infrastructure to Power Korea’s AI Innovation

Accenture and Carnegie Mellon SEI Unveil Framework to Measure and Advance AI Maturity

Related Posts