If Reproducibility in AI Is Important, Try Model Flows
Reproducibility is absolutely critical in science, but it’s a troublesome characteristic when it comes to AI. Frontier models developed by Big AI may deliver superior accuracy and reasoning capabilities, but they do so largely as black boxes with little regard for reproducibility. If AI is going to turbo-charge scientific productivity, it must do so without compromising reproducibility. The question, then, becomes how to achieve it.
This was the topic of a presentation at the TPC26 conference last week by Noah Smith, a computer scientist at the University of Washington and senior director of NLP research at the Allen Institute for Artificial Intelligence. Smith discussed why it’s important for scientists to have AI tools that meet their needs when it comes to reproducibility, and how model flows can help to deliver them.
Image courtesy Noah Smith, Ai2
“Scientists need to be able to inspect and control their tools. A big part of science is your tools–the engineering, the systems that are going to help you answer questions,” Smith said. “At the Allen Institute for AI and with our collaborators at the University of Washington and other universities, we’ve taken the position that the way to get to this fine-grained control and inspectability is through what we call model flows.”
What exactly is a “model flow”? Smith went on:
“We use this term ‘model flow’ to refer to a kind of full openness,” he continued. “Everything that you need to reproduce the work from the very beginning: all of the data, the model weights…and intermediate checkpoints. We describe the entire recipe. I’ll give you all the code that you need to reproduce any stage so that you can go back and change anything. All of our evaluations are careful and open, and we richly document and analyze the capabilities of the models.”
Clearly, many frontier models fail to check even some of these boxes. Claude, Gemini, and GPT from Anthropic, Google, and OpenAI are all extremely capable models that deliver stellar results on many general purpose topics, but they are closed source and don’t offer the full model flows that is critical for reproducibility. Scientists receiving funding from government institutions, including the Department of Energy and National Science Foundation, can use these proprietary frontier models, although they must meet strict privacy and security guarantees.
Image courtesy Noah Smith, Ai2
There are other challenges with using frontier models from Big AI. For starters, they’re optimized for consumer and enterprise usage, not necessarily for science (although some Big AI providers, like Google, are offering science packages). They also tend to be quite expensive to use at scale, which is why much of the discussion of AI for science and engineering, at least in the public sphere, tends to take place around fully open models.
The Allen Institute for AI (AI2), which received $152 million in funding last August from the NSF and Nvidia, is developing the Olmo 3 family of fully open models, intended primarily for use by scientists and engineers. Olmo 3, available in 7B and 32B sizes, delivers the full model flows that scientists need, and but at a fraction of the data budget of something like Qwen 3, according to Smith.
One of the Olmo 3 models is Molmo, a vision-language model designed to generate textual descriptions from visual input, and MolmoPoint, which adds support for pointing commands. Vision-language models are important for bridging the gap between AI models and agents and robots that are going to act in the real world, Smith said. Molmo2, which was recently released, adds support for video.
There is also DR Tulu, a reinforcement learning (RL) model designed to power deep research agents. The DR Tulu stack gives scientists the ability to create agents that search and browse literature, evaluate relevance, integrate evidence, write answers with attribution, and evaluate precision and recall. It uses RL to create rubrics that evolve based on what the agent discovers. DR Tulu-8B performs comparatively to GPT-5 Search, OpenAI DR, and Claude Sonnet, but at a cost that is 100X to 1,000X less.
Noah Smith, Ai2 director of NLP research and computer science professor at University of Washington
Olmo Hybrid, meanwhile, melds the precise recall of transformers with the superior state tracking of linear recurrent neural networks (RNNs) to create a hybrid model that excels at both. Olmo Hybrid delivers superior performance in math, coding, and other categories compared to Olmo 3-7B, as well as offering better scaling, according to Smith.
While the AI models from Ai2 can deliver comparable performance to proprietary frontier models, they do so with full reproducibility as a result of their open model flows. They’re also more adaptable than frontier models, which Smith cited as another factor in their favor. If scientists value reproducibility, adaptability, and the ability to control their own AI models, then fully open models should be where they are putting their chips, he said.
“I think reproducing commercial AI is too small a goal for those of us working in the open space,” Smith said. “I think building infrastructure for science needs to enable scientific communities to do things that the market is just never going to prioritize: Inspect the internals of the system, adapt it to local scientific requirements, study every aspect of its development so we can make improvements, [and] control the costs and specialize for long-tail domains.”
Related

