Training Isn’t Enough: Reasoning Models and LLMs Need Reinforcement Learning
Most people familiar with generative models know that LLMs are trained on the entirety of the internet’s content. Many regard their millions of parameters and hyperparameters, which dwarf the quantity of those of the previous generation of neural networks, as a considerable advancement.
However, few people realize that these language models are nowhere near ready for production settings, especially not for enterprise users, without reinforcement learning.
This form of machine learning is necessary to finalize—if not optimize—LLMs and the next progression of these models, reasoning models. Reasoning models are those for which the answers to questions are returned along with the steps, or reasoning, the model invoked to generate the output. In contrast, LLMs are primarily designed to predict which words occur next in a sequence—not necessarily understand their meaning, or relationship to a question (which accounts for their ‘hallucinations’). In general, they lack a concrete knowledge base.
(HAKINMHAN/Shutterstock)
According to Jorge Silva, director of AI and machine learning at SAS, “Where does reinforcement learning come into play for all of this? Because you need to know at every step of training, which is iterative and passes through the data many, many times over, so you see how well you’re doing. And, the one thing you want to be sure that you’re doing is predicting the right word.”
Reinforcement learning provides rewards for the training steps Silva mentioned while building an effective policy to optimize the predictions of language models. For reasoning models, reinforcement learning goes a step beyond just predicting the next word that occurs in a sequence. It ensures models instantiate concrete reasons for their responses—giving them much more of a cosmology or worldview than LLMs have.
“You don’t just want to predict the next word when you’re training a reasoning model,” Silva said. “When training something like DeepSeek-R1, you want to also provide well-formatted and well-substantiated reasoning sections.”
Deconstructing Reinforcement Learning
Unlike supervised learning and self-supervised learning, reinforcement learning doesn’t require training data. Instead, the model’s learning stems from an agent dynamically interacting with an environment. Reasoning models and LLMs, which are primarily trained on self-supervised learning, provide such an environment.
“Because reinforcement learning is about making a decision sequentially in the presence of uncertainty, it is ideal to have a reinforcement learning policy as your agent, essentially,” Silva said. Reinforcement learning policies provide the logical basis for agents to either reward or penalize the outputs of the various steps models undergo in training. For example, responses that contain explanations for the outputs of reasoning models are rewarded higher than outputs without them.
(Chaosamran_Studio/Shutterstock)
Group Relative Policy Optimization
Group Relative Policy Optimization is the specific reinforcement learning approach used for reasoning models such as DeepSeek-R1. With this technique, “It’s essentially generating a variety of candidate answers,” Silva explained. “If you literally give it the same prompt 10 times over, it gets 10 different completions. Then, it says which one of these completions has the best advantage.” There are multiple meanings for the term completions in the broader context of machine learning. However, when specifically applied to reinforcement learning, completions “means it’s generating its own labels to avoid having to rely on supervised learning,” Silva said.
With this approach, the completion with the greatest advantage for the task at hand factors into the computation of the reward—which informs the building of the reinforcement learning policy. “Then, that completion serves as the label for training the overall LLM,” Silva said. Thus, the reward (which is numerically based) informs how the policy is developed to best facilitate the overall objective for the model. Moreover, when specifically applied to reasoning models, “This is part of the optimization for LLMs for just turning DeepSeek-V3 into DeepSeek-R1, so that you go from a foundation model to a reasoning model,” Silva commented.
(Source: Shutterstock AI Image)
Candidate Completions
Completions are an integral facet of Group Relative Policy Optimization because they represent a variety of potential responses to a step in the logic that reasoning models might perform. Although reinforcement learning doesn’t involve training data, the self-supervised learning approaches for training LLMs and reasoning models does. With this form of machine learning, “you take the document and only give the LLM, while it’s training, a little bit of the document at the time and it has to guess the rest,” Silva said. The different candidate completions involved in Group Relative Policy Optimization are based in part on the training the model has already undergone. According to Silva, “Each completion will provide an alternative reasoning explanation, and each one will have a different reward value, and that will be very much based on matching what the correct explanation would’ve been and making sure the correct answer is there.”
There are different methods for computing the rewards of the varying completions. Sometimes, reinforcement learning agents can scrutinize the search and response strings accompanying, for example, how to solve a particular quadratic equation. In this case, they give higher rewards to completions involving particular mathematical operations, like the square root of plus or minus one. Other times, rewards might be based on information in delimiters pertaining to think tags. With this approach, there’s a fleshed-out reasoning section in the completion to base the reward on. As always, “You need to know the correct answer, but you need to make sure that there is a reasoning [section], Silva said. “So, we have a way to compute the reward for all the candidate completions.
The Reason for Reasoning Models
The import of reasoning models is multifaceted. They encode the logic and steps taken for their outputs, which are maximized by reinforcement learning techniques like Group Relative Policy Optimization. Moreover, by being built to provide explanations, they may prove to be more human-like in their interactions with people. “If you look at the Deep Seek paper, the way they proposed it, it says things like ‘wait a minute, I need to tell the user this’,” Silva said. “It’s like a human; all of a sudden, something occurs to the LLM. They stop their outputs and they maybe even say no, let’s do this instead. So, it’s more of a back and forth.”
About the Author
Jelani Harper has worked as a research analyst, research lead, information technology editorial consultant, and journalist for over 10 years. During that time, he has helped myriad vendors and publications in the data management space strategize, develop, compose, and place content in a variety of outlets. He has produced an assortment of technical content, including analyst reports, white papers, solutions briefs, contract proposals, marketing materials, thought leadership articles, bylines, and blogs for clients specializing in nearly every facet of data management. As such, Jelani has focused extensively on the numerous dimensions of cognitive computing, cloud computing, analytics, data governance, data engineering, and data integration. His work has enabled him to conduct a number of substantive interviews with both established and progressive vendors in the arenas of high performance computing, semantic technologies, quantum computing, cybersecurity, the Internet of Things, blockchain, and more. He’s spent the last couple of years working with Blue Badge Insights.
Related

