Researchers at UC San Diego and MIT have found a new method of influencing the way large language models think and respond. Published last week in the journal Science, the work describes the new tool and how it can improve model behavior, while also warning about its potential safety risks.
Despite advances in model scale and alignment techniques, the internal activation space of LLMs has remained only partially understood. Developers can tune behavior at the surface, but the underlying mathematical representations that drive responses have been hard to access directly. Typical ways to improve model behavior, such as prompt tuning and reinforcement learning from human feedback, operate on the outside. They do not fundamentally change what the model “knows” or how it internally represents information.
(Shutterstock)
This new method works by identifying mathematical representations of specific concepts inside a model’s activation space and then using those representations to influence how the model responds. In the paper, the authors show that many concepts, ranging from fears to moods to locations, can be represented as linear directions or vectors in the model’s internal activation geometry. Once these vectors are isolated, they can be added or subtracted from a model’s activations to steer responses toward or away from the corresponding concept for a given prompt. This lets the researchers steer the model’s outputs in predictable directions without retraining the entire network.
According to a UC San Diego Today report, the study, led by Mikhail Belkin at UC San Diego and Adit Radhakrishnan at MIT, applied this steering approach to widely used open-source models such as Llama and DeepSeek, identifying and influencing 512 concepts across five classes. The work builds on the researchers’ 2024 Science paper that introduced Recursive Feature Machines, or predictive algorithms designed to detect semantic patterns within a model’s internal layers. Once those patterns are isolated, they can be adjusted directly, almost like turning a control knob inside the network, allowing researchers to steer behavior without retraining the entire system.
A notable feature of this method is how efficiently it can be deployed. The team reports they were able to find and manipulate concept patterns with fewer than 500 training samples and in under a minute on a single high-end GPU. This is a much different scenario than fine-tuning large models, which often requires massive datasets and extensive compute infrastructure.
The researchers say the method has several potential benefits. Steering can make a model perform better on narrow, specialized tasks, such as code translation from Python to C++. It can also help identify hallucinations, or outputs in which the model confidently gives false or fabricated information. For those who rely on LLMs for real-world applications, identifying the source of hallucinations could translate into more capable and reliable systems with fewer surprises.
“Our instinct as humans is to control and monitor AI models through natural language. However, neural networks natively deal with information through their internal mathematical processes. Our work demonstrates what you can gain by operating directly on these processes,” said Daniel Beaglehole, a computer science Ph.D. student at UC San Diego who is a co-author of this study, in the UC San Diego Today report.
(thinkhubstudio/Shutterstock)
These experiments suggest that mechanisms designed to steer internal representations could also be used to bypass safeguards. By weakening the internal representation of refusal, or the part of the model that says no to harmful requests, the researchers were able to elicit policy-violating responses. Examples of these include the model providing instructions on drug use and echoing conspiracy theories like a flat Earth cover-up. This raises further questions about how these steering techniques might be misused as well as prevented, as this type of internal jailbreaking is much harder to stop with surface-level guardrails.
One caveat is that the team tested their steering method only on open source models and did not evaluate it on closed proprietary models from major AI vendors. This means it is still unknown how well the same steering techniques would work on closed models with restricted internal access. Within the open source models they did examine, the researchers found that newer and larger models tended to be easier to steer toward concepts of interest. They also suggested, based on the computational efficiency of the technique, that similar steering might be possible for much smaller open-source models that could run on a laptop. The authors also say future work will likely explore refining the steering approach so it can adapt to specific inputs and practical applications.
The deeper significance of internal steering techniques is what they could reveal about the structure of AI models. If researchers can map how models encode and organize semantic information, then guiding those internal representations could become as fundamental to AI development as scaling parameters or refining training data. Tools like these could be the key to building more trustworthy and capable systems in the future.
The post Researchers Demonstrate New Internal Steering Technique for LLMs appeared first on AIwire.

