By Sahil Kapoor - 18 Apr 2026

RLHF (Reinforcement Learning from Human Feedback)

RLHF is the training recipe that turned raw language models (good at predicting text) into aligned assistants (good at following instructions helpfully and safely). It was popularized by InstructGPT (2022) and is the foundation of every major chat LLM.

The Three-Stage Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Start with a pre-trained base LLM. Fine-tune it on a dataset of (prompt, ideal response) pairs, where the ideal responses are written by human annotators following quality guidelines. This produces an SFT model that can follow instructions but isn't yet preference-optimized.

Stage 2: Reward Model Training

Collect comparison data: for a given prompt, show annotators N responses from the SFT model and ask them to rank them (or prefer A vs B). Train a separate reward model that learns to predict human preference scores. This reward model is the "judge" for stage 3.

Stage 3: RL Fine-Tuning (PPO)

Use the reward model's scores as the reinforcement signal. The SFT model (now the "policy") generates responses; the reward model scores them; PPO (Proximal Policy Optimization) updates the policy to generate responses that score higher, while a KL divergence penalty keeps it from drifting too far from the SFT baseline (preventing reward hacking).

DPO: A Simpler Alternative

Direct Preference Optimization (DPO) skips the separate reward model and RL loop. It directly optimizes the policy on preference pairs using a closed-form loss. DPO is simpler to implement, more stable to train, and produces comparable results, it's increasingly the preferred approach for smaller teams.

Why RLHF Matters

Without RLHF, LLMs are sycophantic (agree with the user regardless), harmful (generate dangerous content), or unhelpful (verbosely dodge questions). RLHF instills values: helpfulness, harmlessness, and honesty. The specific balance of these properties is determined by the humans doing the preference annotation, which is why different models have different "personalities."

LoRA and RLHF

Lora Low Rank Adaptation is frequently used to make RLHF practical: fine-tuning only adapter weights instead of all parameters reduces memory and compute costs for both the SFT and RL stages.

Lora Low Rank Adaptation, parameter-efficient fine-tuning used within RLHF pipelines
Prompt Engineering, techniques for guiding model behavior at inference time (alternative to training)
Inference Endpoint, where the RLHF-trained model is ultimately served