Covering Scientific & Technical AI

Training Isn’t Enough: Reasoning Models and LLMs Need Reinforcement Learning

by Jelani Harper |
April 20, 2026

Most people familiar with generative models know that LLMs are trained on the entirety of the internet’s content. Many regard their millions of parameters and hyperparameters, which dwarf the quantity of those of the previous generation of neural networks, as a considerable advancement.

However, few people realize that these language models are nowhere near ready for production settings, especially not for enterprise users, without reinforcement learning.

This form of machine learning is necessary to finalize—if not optimize—LLMs and the next progression of these models, reasoning models. Reasoning models are those for which the answers to questions are returned along with the steps, or reasoning, the model invoked to generate the output. In contrast, LLMs are primarily designed to predict which words occur next in a sequence—not necessarily understand their meaning, or relationship to a question (which accounts for their ‘hallucinations’). In general, they lack a concrete knowledge base.

(HAKINMHAN/Shutterstock)

According to Jorge Silva, director of AI and machine learning at SAS, “Where does reinforcement learning come into play for all of this? Because you need to know at every step of training, which is iterative and passes through the data many, many times over, so you see how well you’re doing. And, the one thing you want to be sure that you’re doing is predicting the right word.”

Reinforcement learning provides rewards for the training steps Silva mentioned while building an effective policy to optimize the predictions of language models. For reasoning models, reinforcement learning goes a step beyond just predicting the next word that occurs in a sequence. It ensures models instantiate concrete reasons for their responses—giving them much more of a cosmology or worldview than LLMs have.

“You don’t just want to predict the next word when you’re training a reasoning model,” Silva said. “When training something like DeepSeek-R1, you want to also provide well-formatted and well-substantiated reasoning sections.”

Deconstructing Reinforcement Learning

Unlike supervised learning and self-supervised learning, reinforcement learning doesn’t require training data. Instead, the model’s learning stems from an agent dynamically interacting with an environment. Reasoning models and LLMs, which are primarily trained on self-supervised learning, provide such an environment.

“Because reinforcement learning is about making a decision sequentially in the presence of uncertainty, it is ideal to have a reinforcement learning policy as your agent, essentially,” Silva said. Reinforcement learning policies provide the logical basis for agents to either reward or penalize the outputs of the various steps models undergo in training. For example, responses that contain explanations for the outputs of reasoning models are rewarded higher than outputs without them.

(Chaosamran_Studio/Shutterstock)

Group Relative Policy Optimization

Group Relative Policy Optimization is the specific reinforcement learning approach used for reasoning models such as DeepSeek-R1. With this technique, “It’s essentially generating a variety of candidate answers,” Silva explained. “If you literally give it the same prompt 10 times over, it gets 10 different completions. Then, it says which one of these completions has the best advantage.” There are multiple meanings for the term completions in the broader context of machine learning. However, when specifically applied to reinforcement learning, completions “means it’s generating its own labels to avoid having to rely on supervised learning,” Silva said.

With this approach, the completion with the greatest advantage for the task at hand factors into the computation of the reward—which informs the building of the reinforcement learning policy. “Then, that completion serves as the label for training the overall LLM,” Silva said. Thus, the reward (which is numerically based) informs how the policy is developed to best facilitate the overall objective for the model. Moreover, when specifically applied to reasoning models, “This is part of the optimization for LLMs for just turning DeepSeek-V3 into DeepSeek-R1, so that you go from a foundation model to a reasoning model,” Silva commented.

(Source: Shutterstock AI Image)

Candidate Completions

Completions are an integral facet of Group Relative Policy Optimization because they represent a variety of potential responses to a step in the logic that reasoning models might perform. Although reinforcement learning doesn’t involve training data, the self-supervised learning approaches for training LLMs and reasoning models does. With this form of machine learning, “you take the document and only give the LLM, while it’s training, a little bit of the document at the time and it has to guess the rest,” Silva said. The different candidate completions involved in Group Relative Policy Optimization are based in part on the training the model has already undergone. According to Silva, “Each completion will provide an alternative reasoning explanation, and each one will have a different reward value, and that will be very much based on matching what the correct explanation would’ve been and making sure the correct answer is there.”

There are different methods for computing the rewards of the varying completions. Sometimes, reinforcement learning agents can scrutinize the search and response strings accompanying, for example, how to solve a particular quadratic equation. In this case, they give higher rewards to completions involving particular mathematical operations, like the square root of plus or minus one. Other times, rewards might be based on information in delimiters pertaining to think tags. With this approach, there’s a fleshed-out reasoning section in the completion to base the reward on. As always, “You need to know the correct answer, but you need to make sure that there is a reasoning [section], Silva said. “So, we have a way to compute the reward for all the candidate completions.

The Reason for Reasoning Models

The import of reasoning models is multifaceted. They encode the logic and steps taken for their outputs, which are maximized by reinforcement learning techniques like Group Relative Policy Optimization. Moreover, by being built to provide explanations, they may prove to be more human-like in their interactions with people. “If you look at the Deep Seek paper, the way they proposed it, it says things like ‘wait a minute, I need to tell the user this’,” Silva said. “It’s like a human; all of a sudden, something occurs to the LLM. They stop their outputs and they maybe even say no, let’s do this instead. So, it’s more of a back and forth.”

About the Author

Jelani Harper has worked as a research analyst, research lead, information technology editorial consultant, and journalist for over 10 years. During that time, he has helped myriad vendors and publications in the data management space strategize, develop, compose, and place content in a variety of outlets. He has produced an assortment of technical content, including analyst reports, white papers, solutions briefs, contract proposals, marketing materials, thought leadership articles, bylines, and blogs for clients specializing in nearly every facet of data management. As such, Jelani has focused extensively on the numerous dimensions of cognitive computing, cloud computing, analytics, data governance, data engineering, and data integration. His work has enabled him to conduct a number of substantive interviews with both established and progressive vendors in the arenas of high performance computing, semantic technologies, quantum computing, cybersecurity, the Internet of Things, blockchain, and more. He’s spent the last couple of years working with Blue Badge Insights.

QCWire Graphic

Yann LeCun’s AMI Secures $1B Seed to Develop AI World Models

Turing Award–winning AI researcher Yann LeCun has spent years arguing that large language models, at…

Deloitte’s State of AI 2026: Why Enterprise Execution Is Falling Behind Adoption

Deloitte’s latest State of the AI report shows that AI adoption continues to accelerate rapidly,…

SambaNova Eyes 10-Trillion Parameter Models for Agentic AI with New Chip

SambaNova has unveiled its latest chip, the SN50, which it says is five times faster…

OpenAI Shutters Sora, Shifts Business Strategy Ahead of IPO

Back in 2022, OpenAI set off a chain reaction in the tech world when it…

Gartner Forecasts 90% Drop in LLM Inference Costs by 2030

STAMFORD, Conn., March 25, 2026 — By 2030, performing inference on a large language model (LLM)…

Anthropic Unveils ‘Project Glasswing’ as Claude Mythos Targets Software Vulnerabilities

April 9, 2026 — Anthropic has announced Project Glasswing, a new initiative that brings together Amazon…

Training Isn’t Enough: Reasoning Models and LLMs Need Reinforcement Learning

Most people familiar with generative models know that LLMs are trained on the entirety of…

UCSD: AI-Enhanced Microscopy Produces Crisp, Real-time Video Inside Live Cells

April 20, 2026 — Using artificial intelligence, engineers at the University of California San Diego…

Cadence Maps Its Future Beyond EDA With Agentic AI and Simulation

At Cadence’s annual user conference in Santa Clara this week, anticipation in the room was…

LLNL Combines Machine Learning and 3D Printing for Shockwave Control Experiments

April 16, 2026 — Picture two materials sandwiched together. The boundary between them may appear flat,…

Canada Opens Applications for AI Supercomputing Infrastructure Program

OTTAWA, Ontario, April 16, 2026 — Canada is launching a national effort to build one of…

AMD and French Government Announce Plans to Advance AI Innovation, Research and Open Ecosystem Development

PARIS, April 16, 2026 – Today, AMD and representatives of the French government announced plans…

Source link

What's Hot

Nothing shares Phone (4b) design teaser, hints at single rear camera

Inside the world’s deepest and longest subsea road tunnel

Google Messages reigns supreme on Android, and for good reason

Covering Scientific & Technical AI

Deconstructing Reinforcement Learning

Group Relative Policy Optimization

Candidate Completions

The Reason for Reasoning Models

Related

Yann LeCun’s AMI Secures $1B Seed to Develop AI World Models

Deloitte’s State of AI 2026: Why Enterprise Execution Is Falling Behind Adoption

SambaNova Eyes 10-Trillion Parameter Models for Agentic AI with New Chip

OpenAI Shutters Sora, Shifts Business Strategy Ahead of IPO

Gartner Forecasts 90% Drop in LLM Inference Costs by 2030

Anthropic Unveils ‘Project Glasswing’ as Claude Mythos Targets Software Vulnerabilities

Training Isn’t Enough: Reasoning Models and LLMs Need Reinforcement Learning

UCSD: AI-Enhanced Microscopy Produces Crisp, Real-time Video Inside Live Cells

Cadence Maps Its Future Beyond EDA With Agentic AI and Simulation

LLNL Combines Machine Learning and 3D Printing for Shockwave Control Experiments

Canada Opens Applications for AI Supercomputing Infrastructure Program

AMD and French Government Announce Plans to Advance AI Innovation, Research and Open Ecosystem Development

Inside the world’s deepest and longest subsea road tunnel

28 Tips to Take Your ChatGPT Prompts to the Next Level

Siri AI Hands On: A Smart, Helpful Assistant

The inevitable weakness of metrics

iPhone Pro 13 Rumored to Feature 1 TB of Storage

Oculus Quest X Headset: Discover a Shining New Star

Fujifilm’s 102-Megapixel Camera is the Size of a Typical DSLR

Review: Mi 10 Mobile with Qualcomm Snapdragon 870 Mobile Platform

Comparison of Mobile Phone Providers: 4G Connectivity & Speed

Which LED Lights for Nail Salon Safe? Comparison of Major Brands

Subscribe to Updates

What's Hot

Covering Scientific & Technical AI

Training Isn’t Enough: Reasoning Models and LLMs Need Reinforcement Learning

Deconstructing Reinforcement Learning

Group Relative Policy Optimization

Candidate Completions

The Reason for Reasoning Models

Related

Yann LeCun’s AMI Secures $1B Seed to Develop AI World Models

Deloitte’s State of AI 2026: Why Enterprise Execution Is Falling Behind Adoption

SambaNova Eyes 10-Trillion Parameter Models for Agentic AI with New Chip

OpenAI Shutters Sora, Shifts Business Strategy Ahead of IPO

Gartner Forecasts 90% Drop in LLM Inference Costs by 2030

Anthropic Unveils ‘Project Glasswing’ as Claude Mythos Targets Software Vulnerabilities

Training Isn’t Enough: Reasoning Models and LLMs Need Reinforcement Learning

UCSD: AI-Enhanced Microscopy Produces Crisp, Real-time Video Inside Live Cells

Cadence Maps Its Future Beyond EDA With Agentic AI and Simulation

LLNL Combines Machine Learning and 3D Printing for Shockwave Control Experiments

Canada Opens Applications for AI Supercomputing Infrastructure Program

AMD and French Government Announce Plans to Advance AI Innovation, Research and Open Ecosystem Development

Related Posts