RLHF (Reinforcement Learning from Human Feedback)

A training technique that uses human preference data to align language model behavior with human values — the key method behind ChatGPT, Claude, and Gemini's instruction-following and safety properties.

RLHF is the training recipe that turned raw language models (good at predicting text) into aligned assistants (good at following instructions helpfully and safely). It was popularized by InstructGPT (2022) and is the foundation of every major chat LLM.

The Three-Stage Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Start with a pre-trained base LLM. Fine-tune it on a dataset of (prompt, ideal response) pairs, where the ideal responses are written by human annotators following quality guidelines. This produces an SFT model that can follow instructions but isn't yet preference-optimized.

Stage 2: Reward Model Training

Collect comparison data: for a given prompt, show annotators N responses from the SFT model and ask them to rank them (or prefer A vs B). Train a separate reward model that learns to predict human preference scores. This reward model is the "judge" for stage 3.

Stage 3: RL Fine-Tuning (PPO)

Use the reward model's scores as the reinforcement signal. The SFT model (now the "policy") generates responses; the reward model scores them; PPO (Proximal Policy Optimization) updates the policy to generate responses that score higher — while a KL divergence penalty keeps it from drifting too far from the SFT baseline (preventing reward hacking).

DPO: A Simpler Alternative

Direct Preference Optimization (DPO) skips the separate reward model and RL loop. It directly optimizes the policy on preference pairs using a closed-form loss. DPO is simpler to implement, more stable to train, and produces comparable results — it's increasingly the preferred approach for smaller teams.

Why RLHF Matters

Without RLHF, LLMs are sycophantic (agree with the user regardless), harmful (generate dangerous content), or unhelpful (verbosely dodge questions). RLHF instills values: helpfulness, harmlessness, and honesty. The specific balance of these properties is determined by the humans doing the preference annotation — which is why different models have different "personalities."

LoRA and RLHF

Lora Low Rank Adaptation is frequently used to make RLHF practical: fine-tuning only adapter weights instead of all parameters reduces memory and compute costs for both the SFT and RL stages.

Subscribe to Sahil's Playbook

Clear thinking on product, engineering, and building at scale. No noise. One email when there's something worth sharing.
[email protected]
Subscribe
Mastodon