Stony Brook Study Stress-Tests Neural Networks on Thousands of Tiny Rule Systems
STONY BROOK, N.Y., Feb. 20, 2026 — In his office lined with hand-drawn diagrams and alphabet-like symbols, Stony Brook researcher Jeffrey Heinz is trying to answer a deceptively simple question: How well, exactly, can today’s neural networks learn, and where do they fail?
Heinz, a professor with a joint appointment in the Department of Linguistics and the Institute of Advanced Computational Science, usually studies the sound patterns of human language. In his latest project, he and his collaborators have built something that looks less like a traditional linguistics study and more like a stress test for modern AI.
Their work, called MLRegTest, is a carefully designed stress test for neural networks (or other AI techniques), built not to ask a model to write articles or poems, but to pose thousands upon thousands of tiny yes-no questions about simple symbol patterns, and watch very closely what happens.
Heinz said, “We’re trying to understand the learning capacities of neural networks from a controlled experimental point of view,” Heinz said. “It’s an endeavor to map their performance on kind of a big scale.”
1,800 Tiny ‘Languages’ for Machines
At the heart of MLRegTest is a huge collection of what computer scientists call formal languages, not English or Spanish or French, but small, made-up rule systems over simple symbols.
Think of them like this:
- There’s a small alphabet of symbols (like A, B, C, D).
- You line them up into short sequences.
- A hidden rule decides which sequences “belong” and which don’t.
A model’s job is to look at lots of examples and learn to answer: Does this sequence follow the rule, yes or no?
Heinz and his team constructed 1,800 different rule systems of this kind. Then, for each one, they generated a plethora of example sequences, some that obeyed the rule and others that didn’t. “This was a large-scale effort because we wanted to create a benchmark that would be around for a while,” Heinz added.
Each rule is built using a compact mathematical description called a grammar — like an equation that defines which sequences are in and which are out. The team also made sure that every language in the collection was genuinely distinct, not just a disguised version of another rule.
That careful construction matters. Because the patterns are so tightly defined, they give researchers a way to “look inside” the black box of neural networks without actually opening it up.
Pushing AI to the Edge of the Rule
Once MLRegTest was built, the team used it to probe how different neural network architectures behaved when they tried to learn these pattern rules. Two parts of the test design were especially revealing.
First, the researchers checked how well models could generalize to longer sequences than the ones they saw during training. That’s a long-standing worry in natural language processing: a system may see mostly short sentences in training, then stumble when faced with something much longer and more complex. MLRegTest confirmed that performance tends to drop as sequences get longer.
But the most striking results came from a more delicate probe: what Heinz calls the “border” of a language.
For each rule system, the team built special test sets made of pairs of almost-identical sequences. Each pair is the same length, with the same symbols in nearly the same positions. The only difference might be a single symbol. According to the underlying rule, one sequence belongs, and the other doesn’t.
To humans, those edge cases are often where the rule becomes clearest: change one small thing, and suddenly the pattern breaks.
The networks struggled.
“We found that the networks were much worse at correctly classifying those strings along the border of the language,” Heinz said. “While they could learn generally pretty well, they weren’t learning the same thing as the underlying pattern.”
In other words, on everyday examples, the models often looked fine. But when you zoom in on the edge cases, they reveal that AI has learned a rough approximation of the rule, not the rule itself.
In high-stakes domains — law, medicine, or engineering — that distinction matters. The “border cases” are exactly where we most need a system to behave reliably.
Not Just ‘More Data, More Compute’
MLRegTest is also a quiet challenge to the current trends of training AI models: “Just feed it more data.”
For every one of the 1,800 languages, the team trained models in three conditions: small, medium, and large training sets. From his linguistics background, Heinz is especially interested in systems that can learn from limited data — more like kids who generalize quickly from relatively few examples.
“I think it’s possible to design learning systems that can generalize accurately and quickly from small amounts of data,” he said. “The trend in AI has not been in that direction. It’s been, let’s consume more data, more compute, more energy.”
This is not only a scientific concern but a practical one. In applications like robotic medical assistance or self-driving cars, many of the most serious situations are rare: a particular combination of weather, road design, and other drivers might occur only one in a million times. A model that only works well on patterns it has seen thousands of times isn’t enough.
Heinz imagines MLRegTest as a kind of challenge problem for anyone who wants to push AI in a different direction.
The benchmark makes it possible to say: here is a wide variety of clearly defined pattern rules; here are strict tests on their edge cases; here are three tiers of training data. How far can your system go?
“If any system can do well across the board on the small data set, that would be a truly remarkable accomplishment,” he said.
A Long View on AI’s Limits
MLRegTest answers to a long tradition. In the 1940s, early neural networks were analyzed using symbolic tools that later became the regular expressions many programmers use today. Heinz’s work loops back to that history, using modern mathematical tools to study today’s much larger networks.
He’s realistic about how quickly it will influence commercial models.
“I don’t think Big Tech cares about this at all,” he said with a laugh. “But for researchers who want to understand AI systems, not just deploy them, the benchmark offers something rare: a way to ask precise questions about what a model has really learned.” Notwithstanding, there is growing interest in generative AI for formal languages that can be used to develop verifiable code for applications, like critical infrastructure.
Heinz’s work is a reminder that good performance on a handful of benchmarks doesn’t necessarily mean deep understanding. If we want AI systems we can trust, we’ll need to keep inventing new, sharper ways to test them.
Source: Ankita Nagpal, Stony Brook University
Related

