Supervised learning and reinforcement learning are the same objective

Note (2026): This started as a 2017 note on a Berkeley lecture by Mohammad Norouzi. I held it because the 2017 RL literature was fragmented and the unification felt premature. Nine years later the methods that actually scaled to frontier models (RLHF, DPO, GRPO, reasoning-model fine-tuning) turned out to be instantiations of exactly this framing, so here it is, rewritten.

Supervised learning and reinforcement learning look like different subjects in a textbook. They are the same objective written two ways. Both fit a distribution over outputs conditioned on an input. Both can be cast as minimizing a KL divergence to an implicit “optimal” target distribution. The only differences are which distribution you sample from and which direction of the KL you use. Entropy regularization is the knob that turns one into the other. (This post is a close cousin of the similarity post: similarity-between-distributions is another name for what we are doing here.)

The shared setup

Both paradigms want to learn a mapping from inputs $x$ to outputs or actions $a$ , parameterized as a conditional distribution $π_{θ} (a ∣ x)$ . The output $a$ may be a single label, a caption, a sequence of tokens, or a trajectory of actions; the shape of the output is irrelevant to the argument.

What does change is the training signal.

In supervised learning, every input $x$ comes with a target output $a^{⋆}$ , and you maximize $lo g π_{θ} (a^{⋆} ∣ x)$ averaged over the dataset.
In reinforcement learning, every output $a$ gets a scalar reward $r (a ∣ x)$ , possibly sparse or delayed, and you maximize the expected reward $E_{a \sim π_{θ}} [r (a ∣ x)]$ .

The objectives read as two separate problems. They are not.

The optimal policy is a Boltzmann distribution

Fix the reward function $r (a ∣ x)$ . The optimal entropy-regularized policy at temperature $τ$ is

π^{⋆} (a ∣ x) = \frac{1}{Z ( x )} exp (\frac{r ( a ∣ x )}{τ}) .

This is the softmax / Boltzmann distribution over outputs, with high-reward outputs getting probability mass in proportion to $exp (r / τ)$ . Two limits make it intuitive. As $τ \to 0$ , $π^{⋆}$ concentrates on the argmax (greedy exploitation). As $τ \to \infty$ , $π^{⋆}$ becomes uniform (pure exploration). In between, $τ$ trades off exploration against exploitation in a principled way.

Supervised learning has a similar implicit target. Given a labelled dataset ${(x, a^{⋆})}$ , define $r (a ∣ x) = 1 [a = a^{⋆}]$ . The optimal policy at $τ \to 0$ is a delta on $a^{⋆}$ . This is just saying: the “correct answer” is the Boltzmann distribution at zero temperature, with reward being the indicator of correctness.

Once both paradigms have an optimal target distribution, the question is how the model gets close to it.

Both are KL divergences, in opposite directions

Supervised learning’s cross-entropy objective, written out, is

L_{SL} (θ) = - E_{a \sim π^{⋆}} [lo g π_{θ} (a ∣ x)] \propto KL [π^{⋆} ∥ π_{θ}] .

You sample from $π^{⋆}$ (the data distribution) and push $π_{θ}$ toward it. This is the mode-covering direction of the KL: the model is penalized whenever $π^{⋆}$ puts mass somewhere that $π_{θ}$ does not, so $π_{θ}$ learns to cover every mode of the data.

Reinforcement learning’s entropy-regularized objective $E_{a \sim π_{θ}} [r (a ∣ x)] + τ H [π_{θ}]$ , written with the same Boltzmann $π^{⋆} \propto exp (r / τ)$ , becomes

L_{RL} (θ) = - E_{a \sim π_{θ}} [r (a ∣ x)] - τ H [π_{θ}] = τ KL [π_{θ} ∥ π^{⋆}] - τ lo g Z (x) .

The $lo g Z$ term does not depend on $θ$ , so minimizing this objective is equivalent to minimizing the KL. Plain expected-reward maximization (without the entropy term) is not a KL by itself; the entropy term is exactly what turns the inner product $⟨ π_{θ}, r ⟩$ into a divergence to a target distribution. Almost every RL method that works at scale (RLHF, maximum-entropy RL, soft actor-critic) carries this term, which is why the unification is practically, and not just nominally, useful.

You sample from $π_{θ}$ (the policy) and push it toward $π^{⋆}$ . This is the mode-seeking direction: the model is penalized whenever it puts mass where $π^{⋆}$ does not, so $π_{θ}$ learns to concentrate on high-reward regions.

Once entropy regularization is in the picture, the only structural differences between supervised learning and reinforcement learning are:

Which distribution you sample from at training time, $π^{⋆}$ (data) for SL, $π_{θ}$ (policy) for RL.
Which direction of the KL you optimize, mode-covering for SL, mode-seeking for RL.

Everything else (sample efficiency, variance, off-policy corrections, actor-critic, baselines) is engineering around those two choices.

Entropy regularization bridges them

Once you see that both are KL objectives, a family of intermediate methods falls out.

Reward-augmented maximum likelihood (Norouzi et al., 2016) samples proposals from $π^{⋆}$ at a positive temperature and treats them as soft targets. You get supervised-style stable training with access to the full reward landscape, not just the argmax.

Entropy-regularized policy gradients add an $H [π_{θ}]$ term to the RL objective. This is exactly the KL-to- $π^{⋆}$ at positive temperature, which prevents the policy from collapsing to a narrow mode and keeps exploration alive.

UREX (Under-appreciated Reward Exploration) mixes the two KL directions so the model benefits from both mode-covering (to avoid forgetting good solutions) and mode-seeking (to concentrate on the best ones).

In all three cases the knob is the same. It is the temperature $τ$ of the Boltzmann target, or equivalently the coefficient of the entropy regularizer.

What this predicted about 2024-2026

This framing sat in a Berkeley lecture in 2017 and mostly waited. What happened over the next eight years was that the methods that actually scaled to frontier models turned out to be instantiations of it.

RLHF is exactly the entropy-regularized RL objective at a positive temperature, with the reward model $r_{ϕ} (a ∣ x)$ playing the role of the (learned) reward. The KL penalty against the base model that every RLHF paper includes is the entropy-regularization term, written against the prior rather than the uniform distribution.

Direct Preference Optimization (Rafailov et al., 2023) starts from the same Boltzmann target. The authors noticed that if you plug the optimal policy form $π^{⋆} \propto π_{ref} exp (r / τ)$ back into the Bradley-Terry preference likelihood, the reward cancels out and you can optimize the policy directly against preference pairs as a logistic regression on log-ratios $lo g π_{θ} (a) / π_{ref} (a)$ . The resulting loss is supervised-style (no sampling from $π_{θ}$ , no gradient-variance concerns) even though it is derived from the same entropy-regularized RL objective, which is why DPO is often described as “RLHF done as supervised learning.”

GRPO and its descendants (the post-training methods behind the 2024-2025 reasoning models) are online entropy-regularized RL with group baselines instead of value functions, but the objective is structurally the same.

Reasoning-model fine-tuning (OpenAI o-series, DeepSeek-R1, and the open replications) computes a verifiable reward on math and code problems and trains against the mode-seeking KL. The verifiable reward makes $τ$ effectively small, so the optimal policy concentrates sharply on correct traces.

None of these methods required a conceptual breakthrough. The scaffolding was already in place. What was missing was a big enough base model to make the Boltzmann target well-defined on useful tasks, and a cheap enough reward signal (preference labels, unit tests, verifiers) to estimate the gradient of the KL.

Takeaways

A few things are worth remembering from this framing.

The split between supervised and reinforcement learning is about which distribution you sample from, not about what you are optimizing. If your reward is sparse and delayed, you sample from the policy and pay the variance cost. If your reward is dense (a label, a unit test), you can sample from the data and get a lower-variance gradient.

The KL direction is not a cosmetic choice. Mode-covering makes training stable but leaves mass on bad outputs. Mode-seeking sharpens the policy but risks collapse. Real methods that work at scale tend to mix both.

Temperature is the same knob as the entropy coefficient, the KL penalty, and the preference-likelihood scaling. Every post-training paper has it; different papers give it different names.

A method that does not fit in this frame is a signal that something is genuinely new, and worth paying attention to. Most methods fit.

Reading

Norouzi et al., Reward Augmented Maximum Likelihood for Neural Structured Prediction (NeurIPS 2016).
Nachum et al., Bridging the Gap Between Value and Policy Based Reinforcement Learning (NeurIPS 2017).
Rafailov et al., Direct Preference Optimization: Your Language Model is Secretly a Reward Model (NeurIPS 2023).
Shao et al., DeepSeekMath / GRPO (2024).

Cover image: Go stones on a goban, by Dietmar Rabich, CC BY-SA 4.0.