RENT: Reinforcement Learning via Entropy Minimization is a fully unsupervised reinforcement learning method that improves reasoning performance by using the model's own confidence as a reward. Given an input problem \(\mathbf{x}\), the model generates a reasoning chain \(\mathbf{y} = \pi(\mathbf{x})\) and receives a reward based on the negative entropy of its token predictions: \(R(\mathbf{y}) = -H(\pi(\mathbf{x}))\). This encourages the model to produce more confident predictions. We find that minimizing entropy over tokens near the end of the reasoning chain correlates most strongly with improved accuracy. RENT requires no external reward or ground-truth answers and consistently improves performance across diverse reasoning benchmarks including GSM8K, MATH500, AMC, AIME, and GPQA.
Reinforcement learning (RL) has enabled machine learning models to achieve significant advances in many fields. Most recently, RL has empowered frontier language models to solve challenging math, science, and coding problems. However, central to any RL algorithm is the reward function, and reward engineering is a notoriously difficult problem in any domain. In this paper, we propose RENT: Reinforcement Learning via Entropy Minimization – a fully unsupervised RL method that requires no external reward or ground-truth answers, and instead uses the model’s entropy of its underlying distribution as an intrinsic reward. We find that by reinforcing the chains of thought that yield high model confidence on its generated answers, the model improves its reasoning ability. In our experiments, we showcase these improvements on an extensive suite of commonly-used reason- ing benchmarks, including GSM8K, MATH500, AMC, AIME, and GPQA, and models of varying sizes from the Qwen and Mistral families. The generality of our unsupervised learning method lends itself to applicability in a wide range of domains where external supervision is limited or unavailable.
For a given prompt \(\mathbf{x}\), the model generates a response \(\mathbf{y}_{pred} = y_{pred,1}, \ldots, y_{pred,T} = \pi(\mathbf{x})\), where \(T\) is the number of tokens in the response. At each token \(t \in \{1, \ldots, T\}\), the model outputs a probability distribution \(p_t\) over the vocabulary \(\mathcal{V}\). The entropy of this distribution measures the model's uncertainty:
\[H(p_t) = -\sum_{v \in \mathcal{V}} p_t(v) \log p_t(v)\]
We use the negative entropy of the predicted token distribution as a reward signal:
\[R(\mathbf{y}_{pred}) = -H(\pi(\mathbf{x})) = \sum_{t=1}^T \sum_{v \in \mathcal{V}} p_t(v) \log p_t(v)\]
This reward encourages the model to produce more confident and peaked distributions over the vocabulary. We optimize using Group Relative Policy Optimization (GRPO), which evaluates the policy relative to a group of baseline policies for improved stability:
\[\mathcal{L}(\pi) = \mathbb{E}_{\mathbf{y}_{pred} \sim \pi(\mathbf{x})}[R(\mathbf{y}_{pred})] - \frac{1}{K} \sum_{i=1}^K \mathbb{E}_{\mathbf{y}_{pred} \sim \pi_i(\mathbf{x})}[R(\mathbf{y}_{pred})]\]
We evaluate RENT across multiple model families (Mistral, LLaMA, and Qwen) and sizes on diverse reasoning benchmarks. Our method consistently improves performance across all tested configurations without requiring any external supervision or ground-truth answers.
Model | GSM8K | MATH500 | AMC | AIME | GPQA |
---|---|---|---|---|---|
Mistral-7B Instruct | 0.381 | 0.147 | 0.049 | 0.002 | 0.179 |
Mistral-7B + RENT | 0.492 | 0.168 | 0.068 | 0.033 | 0.267 |
Qwen2.5-1.5B Instruct | 0.745 | 0.556 | 0.251 | 0.026 | 0.247 |
Qwen2.5-1.5B + RENT | 0.748 | 0.597 | 0.298 | 0.072 | 0.267 |
Qwen2.5-Math-1.5B Instruct | 0.852 | 0.744 | 0.452 | 0.092 | 0.244 |
Qwen2.5-Math-1.5B + RENT | 0.863 | 0.810 | 0.509 | 0.145 | 0.285 |
Qwen2.5-7B Instruct | 0.906 | 0.762 | 0.423 | 0.110 | 0.312 |
Qwen2.5-7B + RENT | 0.911 | 0.823 | 0.518 | 0.270 | 0.365 |
Qwen2.5-Math-7B Instruct | 0.956 | 0.834 | 0.495 | 0.143 | 0.225 |
Qwen2.5-Math-7B + RENT | 0.961 | 0.882 | 0.591 | 0.172 | 0.400 |
LLaMA3.1-8B Instruct | 0.857 | 0.496 | 0.221 | 0.061 | 0.206 |
LLaMA3.1-8B + RENT | 0.859 | 0.548 | 0.339 | 0.082 | 0.332 |
We investigated which response tokens are most important for entropy minimization. Through empirical analysis, we find that minimizing entropy over tokens near the end of the reasoning chain – especially those corresponding to the final answer – correlates most strongly with improved accuracy.
The "last chunk" strategy shows high correlation than the "first chunk" strategy. This suggests that the primary source of improvent comes from the final tokens it generates.
We track both accuracy and confidence throughout RENT training to demonstrate that as the model improves its confidence via entropy minimization, the accuracy also improves. This validates our core hypothesis that optimizing confidence leads to better reasoning performance.
@article{prabhudesai2025rent,
title={Maximizing Confidence Alone Improves Reasoning},
author={Prabhudesai, Mihir and Chen, Lili and Ippoliti, Alex and Fragkiadaki, Katerina and Liu, Hao and Pathak, Deepak},
journal={arXiv preprint arXiv:2505.22660},
year={2025}
}