LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning technique that trains only a small set of additional weight matrices instead of all model parameters — reducing training compute and memory by 10–1000×.

LoRA (Low-Rank Adaptation) is a fine-tuning method introduced by Hu et al. at Microsoft in 2021. Instead of updating all billions of parameters in a large model, LoRA freezes the original weights and injects trainable low-rank matrices into each transformer layer. The insight: weight updates during fine-tuning have low "intrinsic rank" — most of the useful signal lives in a much smaller subspace.

The Math

For a weight matrix W (d×k), LoRA learns two small matrices: A (d×r) and B (r×k), where r ≪ min(d,k). The adapted weight is W + BA. At inference, BA is merged into W — no extra latency. Training parameters = r×(d+k) instead of d×k. With r=8 on a 7B parameter model, you train roughly 0.1% of parameters.

QLoRA

QLoRA (Quantized LoRA) extends LoRA by quantizing the base model to 4-bit precision (NF4) before fine-tuning, then training LoRA adapters in 16-bit. This lets you fine-tune a 70B parameter model on a single 48GB A100 — hardware that would normally only fit a 7B model for full fine-tuning. QLoRA is the standard approach for fine-tuning large models on consumer or academic GPU budgets.

When to Use LoRA

  • Domain adaptation — teach a general model the vocabulary and style of a specific domain (legal, medical, code)
  • Instruction following — train a base model to follow chat-style instructions
  • Format control — reliable output formatting (JSON schema, specific response structures)
  • Behavior adjustment — reduce refusals, change tone, instill specific personas

LoRA vs Prompt Engineering

Before investing in LoRA fine-tuning, exhaust Prompt Engineering options. A well-crafted System Prompt with few-shot examples often achieves 80% of what fine-tuning does at zero compute cost. LoRA makes sense when: the task requires knowledge not in the base model, you need consistent output format across millions of calls, or you need to run a smaller/cheaper model for cost reasons after fine-tuning it to match a larger model's quality.

Rlhf and LoRA

Rlhf (Reinforcement Learning from Human Feedback) is often implemented using LoRA for the SFT and RLHF training stages — it's more practical than full fine-tuning at scale.

Inference with LoRA Adapters

LoRA adapters are small files (MBs vs GBs for full weights) that can be hot-swapped on Vllm or Ollama endpoints. This enables "adapter serving" — one base model, multiple task-specific adapters loaded dynamically.

  • Rlhf — fine-tuning paradigm that often uses LoRA internally
  • Vllm — inference engine with LoRA adapter support
  • Ollama — can load custom LoRA-adapted models in GGUF format
  • Prompt Engineering — first thing to try before investing in fine-tuning

Subscribe to Sahil's Playbook

Clear thinking on product, engineering, and building at scale. No noise. One email when there's something worth sharing.
[email protected]
Subscribe
Mastodon