By Sahil Kapoor - 16 Apr 2026

LoRA (Low-Rank Adaptation)

LoRA (Low-Rank Adaptation) is a fine-tuning method introduced by Hu et al. at Microsoft in 2021. Instead of updating all billions of parameters in a large model, LoRA freezes the original weights and injects trainable low-rank matrices into each transformer layer. The insight: weight updates during fine-tuning have low "intrinsic rank", most of the useful signal lives in a much smaller subspace.

The Math

For a weight matrix W (d×k), LoRA learns two small matrices: A (d×r) and B (r×k), where r ≪ min(d,k). The adapted weight is W + BA. At inference, BA is merged into W, no extra latency. Training parameters = r×(d+k) instead of d×k. With r=8 on a 7B parameter model, you train roughly 0.1% of parameters.

QLoRA

QLoRA (Quantized LoRA) extends LoRA by quantizing the base model to 4-bit precision (NF4) before fine-tuning, then training LoRA adapters in 16-bit. This lets you fine-tune a 70B parameter model on a single 48GB A100, hardware that would normally only fit a 7B model for full fine-tuning. QLoRA is the standard approach for fine-tuning large models on consumer or academic GPU budgets.

When to Use LoRA

Domain adaptation, teach a general model the vocabulary and style of a specific domain (legal, medical, code)
Instruction following, train a base model to follow chat-style instructions
Format control, reliable output formatting (JSON schema, specific response structures)
Behavior adjustment, reduce refusals, change tone, instill specific personas

LoRA vs Prompt Engineering

Before investing in LoRA fine-tuning, exhaust Prompt Engineering options. A well-crafted System Prompt with few-shot examples often achieves 80% of what fine-tuning does at zero compute cost. LoRA makes sense when: the task requires knowledge not in the base model, you need consistent output format across millions of calls, or you need to run a smaller/cheaper model for cost reasons after fine-tuning it to match a larger model's quality.

Rlhf and LoRA

Rlhf (Reinforcement Learning from Human Feedback) is often implemented using LoRA for the SFT and RLHF training stages, it's more practical than full fine-tuning at scale.

Inference with LoRA Adapters

LoRA adapters are small files (MBs vs GBs for full weights) that can be hot-swapped on Vllm or Ollama endpoints. This enables "adapter serving", one base model, multiple task-specific adapters loaded dynamically.

Rlhf, fine-tuning paradigm that often uses LoRA internally
Vllm, inference engine with LoRA adapter support
Ollama, can load custom LoRA-adapted models in GGUF format
Prompt Engineering, first thing to try before investing in fine-tuning