RLHF: How AI Learns Human Preferences

Reinforcement Learning from Human Feedback (RLHF) is the technique that turned raw language models into helpful, harmless assistants like ChatGPT. It bridges the gap between "predict next word" and "be useful."

Why Raw LLMs Need Alignment

A base language model trained on internet text can generate coherent text, but it doesn't know how to be helpful, harmless, or honest. It might refuse harmless requests, generate toxic content, or confidently state falsehoods.

The Three‑Step RLHF Pipeline

Step 1 — Supervised Fine‑Tuning (SFT): Collect high‑quality human demonstrations of ideal responses and fine‑tune the base model. Step 2 — Reward Model Training: Have humans rank multiple model responses; train a reward model to predict these preferences. Step 3 — PPO Optimization: Use Proximal Policy Optimization to fine‑tune the model to maximize the reward model's score.

The Reward Model

The reward model is a copy of the language model that outputs a scalar score instead of tokens. It's trained on pairwise comparison data: given two responses, the reward model learns to assign higher scores to the preferred one.

PPO: Proximal Policy Optimization

PPO is the reinforcement learning algorithm that updates the model's policy to maximize reward while staying close to the original policy (via KL divergence penalty). This prevents the model from "reward hacking" — exploiting loopholes to get high scores without being genuinely helpful.

Beyond RLHF: DPO & Alternatives

Direct Preference Optimization (DPO) eliminates the need for a separate reward model by directly optimizing on preference pairs. Constitutional AI uses a set of principles to self‑critique and refine outputs. The field is rapidly evolving toward simpler, more stable alignment methods.

Why Raw LLMs Need Alignment

The Three‑Step RLHF Pipeline

The Reward Model

PPO: Proximal Policy Optimization

Beyond RLHF: DPO & Alternatives

Learn More with AI