RENT: Maximizing Confidence Alone Improves Reasoning

RENT: Reinforcement Learning via Entropy Minimization is a fully unsupervised reinforcement learning method that improves reasoning performance by using the model's own confidence as a reward. Given an input problem \(\mathbf{x}\), the model generates a reasoning chain \(\mathbf{y} = \pi(\mathbf{x})\) and receives a reward based on the negative entropy of its token predictions: \(R(\mathbf{y}) = -H(\pi(\mathbf{x}))\). This encourages the model to produce more confident predictions. We find that minimizing entropy over tokens near the end of the reasoning chain correlates most strongly with improved accuracy. RENT requires no external reward or ground-truth answers and consistently improves performance across diverse reasoning benchmarks including GSM8K, MATH500, AMC, AIME, and GPQA.

Abstract

Reinforcement learning (RL) has enabled machine learning models to achieve significant advances in many fields. Most recently, RL has empowered frontier language models to solve challenging math, science, and coding problems. However, central to any RL algorithm is the reward function, and reward engineering is a notoriously difficult problem in any domain. In this paper, we propose RENT: Reinforcement Learning via Entropy Minimization – a fully unsupervised RL method that requires no external reward or ground-truth answers, and instead uses the model’s entropy of its underlying distribution as an intrinsic reward. We find that by reinforcing the chains of thought that yield high model confidence on its generated answers, the model improves its reasoning ability. In our experiments, we showcase these improvements on an extensive suite of commonly-used reason- ing benchmarks, including GSM8K, MATH500, AMC, AIME, and GPQA, and models of varying sizes from the Qwen and Mistral families. The generality of our unsupervised learning method lends itself to applicability in a wide range of domains where external supervision is limited or unavailable.

Entropy as Reward

For a given prompt \(\mathbf{x}\), the model generates a response \(\mathbf{y}_{pred} = y_{pred,1}, \ldots, y_{pred,T} = \pi(\mathbf{x})\), where \(T\) is the number of tokens in the response. At each token \(t \in \{1, \ldots, T\}\), the model outputs a probability distribution \(p_t\) over the vocabulary \(\mathcal{V}\). The entropy of this distribution measures the model's uncertainty:

\[H(p_t) = -\sum_{v \in \mathcal{V}} p_t(v) \log p_t(v)\]

We use the negative entropy of the predicted token distribution as a reward signal:

\[R(\mathbf{y}_{pred}) = -H(\pi(\mathbf{x})) = \sum_{t=1}^T \sum_{v \in \mathcal{V}} p_t(v) \log p_t(v)\]

This reward encourages the model to produce more confident and peaked distributions over the vocabulary. We optimize using Group Relative Policy Optimization (GRPO), which evaluates the policy relative to a group of baseline policies for improved stability:

\[\mathcal{L}(\pi) = \mathbb{E}_{\mathbf{y}_{pred} \sim \pi(\mathbf{x})}[R(\mathbf{y}_{pred})] - \frac{1}{K} \sum_{i=1}^K \mathbb{E}_{\mathbf{y}_{pred} \sim \pi_i(\mathbf{x})}[R(\mathbf{y}_{pred})]\]

Main Results

We evaluate RENT across multiple model families (Mistral, LLaMA, and Qwen) and sizes on diverse reasoning benchmarks. Our method consistently improves performance across all tested configurations without requiring any external supervision or ground-truth answers.

Performance comparison: Instruct models vs. RENT across reasoning benchmarks
Model	GSM8K	MATH500	AMC	AIME	GPQA
Mistral-7B Instruct	0.381	0.147	0.049	0.002	0.179
Mistral-7B + RENT	0.492	0.168	0.068	0.033	0.267

Qwen2.5-1.5B Instruct	0.745	0.556	0.251	0.026	0.247
Qwen2.5-1.5B + RENT	0.748	0.597	0.298	0.072	0.267

Qwen2.5-Math-1.5B Instruct	0.852	0.744	0.452	0.092	0.244
Qwen2.5-Math-1.5B + RENT	0.863	0.810	0.509	0.145	0.285

Qwen2.5-7B Instruct	0.906	0.762	0.423	0.110	0.312
Qwen2.5-7B + RENT	0.911	0.823	0.518	0.270	0.365

Qwen2.5-Math-7B Instruct	0.956	0.834	0.495	0.143	0.225
Qwen2.5-Math-7B + RENT	0.961	0.882	0.591	0.172	0.400

LLaMA3.1-8B Instruct	0.857	0.496	0.221	0.061	0.206
LLaMA3.1-8B + RENT	0.859	0.548	0.339	0.082	0.332

Which Tokens Matter Most?

We investigated which response tokens are most important for entropy minimization. Through empirical analysis, we find that minimizing entropy over tokens near the end of the reasoning chain – especially those corresponding to the final answer – correlates most strongly with improved accuracy.

Token correlation analysis — Correlation between negative entropy and accuracy for different token selection strategies across GSM8K, MATH500, and AIME benchmarks.

The "last chunk" strategy shows high correlation than the "first chunk" strategy. This suggests that the primary source of improvent comes from the final tokens it generates.

Training Progress: Confidence and Accuracy

We track both accuracy and confidence throughout RENT training to demonstrate that as the model improves its confidence via entropy minimization, the accuracy also improves. This validates our core hypothesis that optimizing confidence leads to better reasoning performance.

Training progress showing accuracy and confidence correlation — Training curves showing the correlation between model confidence and accuracy improvements during RENT training on AMC and MATH500 datasets.

BibTeX

@article{prabhudesai2025rent,
    title={Maximizing Confidence Alone Improves Reasoning},
    author={Prabhudesai, Mihir and Chen, Lili and Ippoliti, Alex and Fragkiadaki, Katerina and Liu, Hao and Pathak, Deepak},
    journal={arXiv preprint arXiv:2505.22660},
    year={2025}
}