LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning technique that trains only a small set of additional weight matrices instead of all model parameters — reducing training compute and memory by 10–1000×.
LoRA (Low-Rank Adaptation) is a fine-tuning method introduced by Hu et al. at Microsoft in 2021. Instead of updating all billions of parameters in a large model, LoRA freezes the original weights and injects trainable low-rank matrices into each transformer layer. The insight: weight updates during fine-tuning have low "intrinsic rank" — most of the useful signal lives in a much smaller subspace.
The Math
For a weight matrix W (d×k), LoRA learns two small matrices: A (d×r) and B (r×k), where r ≪ min(d,k). The adapted weight is W + BA. At inference, BA is merged into W — no extra latency. Training parameters = r×(d+k) instead of d×k. With r=8 on a 7B parameter model, you train roughly 0.1% of parameters.
QLoRA
QLoRA (Quantized LoRA) extends LoRA by quantizing the base model to 4-bit precision (NF4) before fine-tuning, then training LoRA adapters in 16-bit. This lets you fine-tune a 70B parameter model on a single 48GB A100 — hardware that would normally only fit a 7B model for full fine-tuning. QLoRA is the standard approach for fine-tuning large models on consumer or academic GPU budgets.
When to Use LoRA
- Domain adaptation — teach a general model the vocabulary and style of a specific domain (legal, medical, code)
- Instruction following — train a base model to follow chat-style instructions
- Format control — reliable output formatting (JSON schema, specific response structures)
- Behavior adjustment — reduce refusals, change tone, instill specific personas
LoRA vs Prompt Engineering
Before investing in LoRA fine-tuning, exhaust Prompt Engineering options. A well-crafted System Prompt with few-shot examples often achieves 80% of what fine-tuning does at zero compute cost. LoRA makes sense when: the task requires knowledge not in the base model, you need consistent output format across millions of calls, or you need to run a smaller/cheaper model for cost reasons after fine-tuning it to match a larger model's quality.
Rlhf and LoRA
Rlhf (Reinforcement Learning from Human Feedback) is often implemented using LoRA for the SFT and RLHF training stages — it's more practical than full fine-tuning at scale.
Inference with LoRA Adapters
LoRA adapters are small files (MBs vs GBs for full weights) that can be hot-swapped on Vllm or Ollama endpoints. This enables "adapter serving" — one base model, multiple task-specific adapters loaded dynamically.
Related Terms
- Rlhf — fine-tuning paradigm that often uses LoRA internally
- Vllm — inference engine with LoRA adapter support
- Ollama — can load custom LoRA-adapted models in GGUF format
- Prompt Engineering — first thing to try before investing in fine-tuning