ANL’s Rick Stevens Puts AI Agents to the AI-for-Science Test
Have we reached a point where AI agents can reliably function as scientific collaborators? Can they go one step further and work as autonomous scientists?
Stevens is an Associate Laboratory Director for the Computing, Environment and Life Sciences (CELS) Directorate at ANL and a Distinguished Fellow at the laboratory. He is also a Professor of Computer Science at the University of Chicago.
At his TPC26 keynote, he outlined how he used large-scale experiments involving scientific paper replication and model benchmarking to better understand what it would take to accelerate scientific discovery using autonomous AI agents.
Stevens not only wanted to measure the model performance, but also gauge the practical requirements of deploying AI agents at scale. This includes testing the coordination mechanisms and the resources needed to support increasingly complex scientific workflows.
Rick Stevens delivers his address at TPC26
One of the most ambitious parts of his experiment was teaching AI agents how to replicate scientific papers. While reproducing existing research may sound less exciting than making new discoveries, Stevens argued that replication provides a practical way to measure both the capabilities and limitations of today’s AI systems.
“The basic goal here is to hand the paper to the agent and tell it to do everything it can to replicate the paper. So read the paper, build a table of what the principal ideas were, the principal tools, the hypotheses, the assumptions, and then do a parallel implementation,” Stevens said.
“And this was this is pretty interesting to see how this fails, how it works and how it fails, but it’s also a basic building block of doing science, right? And we’re trying to collect information on the throughput and resources needed to do this and what kind of resources are needed.”
The project now includes approximately 100 papers and requires agents to understand scientific methods, identify the necessary tools and datasets, and reproduce published findings. Along the way, the agents are generating new research questions and helping Stevens estimate what it might take to eventually scale AI driven science beyond replication and toward original discovery.
One of the most important findings from the project was that AI agents proved capable of reproducing a meaningful portion of scientific work. Stevens’ experiment evaluated each replication attempt using measures such as coverage and agreement with the original results. Across the papers evaluated so far, agents achieved average scores of roughly 7.5 for coverage and 8 for agreement. More than half scored above 8 on both measures.
Performance varied significantly depending on the type of research. Mathematical papers, theoretical derivations, and studies built around open source software and accessible datasets generally produced the strongest results. In some cases, agents were even able to improve upon published findings by achieving lower error rates than those reported in the original work.
Stevens said the strongest predictor of successful replication was whether authors made their code publicly available.
The project also revealed important limitations. Agents struggled when papers relied on proprietary software and inaccessible datasets. They also did not do well with poorly documented methods or physical experiments.
Stevens observed that many scientific papers contain tacit assumptions that are never explicitly documented, making them difficult to reproduce accurately.
Despite those challenges, the results were encouraging enough to push the project beyond simple replication. The agents are now generating follow up research questions from the papers they analyze and laying the groundwork for future experiments focused on original scientific discovery.
The results also allowed Stevens to begin estimating the resources required to scale AI driven scientific workflows. What started as an experiment in paper replication quickly evolved into a broader effort to understand the infrastructure needed to support large numbers of scientific agents.
“That’s really interesting because it allows us to project the resource requirements that we’re gaining from the replication project into, if you wanted to accelerate science to new and open problems, how much resource might be needed.”
The project was also used to estimate what it would take to scale agent based science further. Replicating 1,000 scientific papers in 10 days would require hundreds of parallel agents, roughly 200,000 GPU hours, millions of CPU hours and hundreds of terabytes of storage. He described the exercise as a way to understand the infrastructure requirements for future AI driven scientific discovery.
Stevens said the team is using replication as a baseline to estimate the effort required for original research. Early results suggest that pursuing new discoveries may require 10 to 30 times more resources than reproducing existing work – depending on the complexity of the problem.
If replication is the first step toward autonomous discovery, Stevens’ project offers an early glimpse into both the promise and the hurdles of building AI systems capable of accelerating science.
Related

