Fine-Tuning a Small LLM on Your Own Data Without Running Out of VRAM
You have a small language model, a dataset you care about, and a GPU that tops out at 12 GB of VRAM. The moment you kick off a full fine-tuning run, your process dies with a CUDA out-of-memory error before the first epoch finishes. Sound familiar?
The good news is that full fine-tuning is rarely what you actually need. A handful of memory-efficient techniques β used together β let you adapt a capable 7-billion-parameter model to your specific task on hardware you already own.
What you'll learn
- Why full fine-tuning is so memory-hungry and what actually consumes VRAM
- How QLoRA (Quantized Low-Rank Adaptation) cuts memory use without wrecking model quality
- How to set up a training run with the Hugging Face
transformers,peft, andbitsandbyteslibraries - Which hyperparameters matter most on a tight VRAM budget
- Common pitfalls that silently inflate memory usage
Prerequisites
You'll need Python 3.10+, a CUDA-capable GPU with at least 8 GB of VRAM, and the following packages installed:
pip install transformers==4.40.0 peft bitsandbytes accelerate datasets trlThe examples below use a 7B-parameter model in the Mistral or Llama family, but the same approach works for any model in the 3Bβ13B range that Hugging Face hosts.
Why Full Fine-Tuning Runs You Out of Memory
When you train a model, GPU memory holds several things at once: the model weights themselves, optimizer states (Adam stores two momentum tensors per parameter), gradients, and activations from the forward pass. For a 7B-parameter model in 32-bit precision, weights alone take roughly 28 GB. Add optimizer states and activations, and you need 80+ GB just to start.
Reducing precision to 16-bit halves weight storage, but optimizer states are still large, and the total is still well beyond a single consumer card. That is why you need a different approach entirely rather than just a lower precision setting.
The Core Idea Behind LoRA and QLoRA
Low-Rank Adaptation (LoRA) freezes the original model weights and injects small trainable matrices into the attention layers. Instead of updating 7 billion parameters, you update a few million β typically less than 1% of the total. The frozen base weights never accumulate gradients, so their memory footprint stays static.
QLoRA takes this further by loading the base model in 4-bit NormalFloat (NF4) quantization. A 7B model in NF4 sits at roughly 4β5 GB. Your trainable LoRA adapters add another 100β300 MB depending on rank. Activations and optimizer states for only the adapter parameters are small enough that 8β12 GB of VRAM is genuinely sufficient.
Quantization compresses the base weights for storage and forward-pass computation. Gradients are still computed and stored in 16-bit for numerical stability β this is the "double quantization" trick QLoRA uses.
Loading Your Model in 4-Bit
The BitsAndBytesConfig object tells the loader how to quantize the base model on the way in. You then attach LoRA adapters on top with get_peft_model.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch
model_id = "mistralai/Mistral-7B-v0.1" # swap for any compatible model
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # saves ~0.4 bits per parameter extra
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token # required for batch padding
lora_config = LoraConfig(
r=16, # rank β higher = more capacity, more memory
lora_alpha=32, # scaling factor, often set to 2 * r
target_modules=["q_proj", "v_proj"], # which attention projections to adapt
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Typical output: trainable params: 4,194,304 || all params: 3,752,071,168 || trainable%: 0.1118Setting device_map="auto" lets Accelerate spread layers across GPU and CPU RAM if needed. On an 8 GB card you usually fit the entire quantized 7B model on GPU without spilling.
Preparing Your Dataset
The format expected by causal language models is a single text string per example. For instruction-following tasks, the standard is to concatenate a prompt and a completion with a clear separator your tokenizer won't split awkwardly.
from datasets import Dataset
# Example: a list of dicts with "instruction" and "response" keys
raw_data = [
{"instruction": "Summarize the following contract clause:", "response": "The clause states..."},
# ... thousands more rows
]
def format_example(row):
return {"text": f"### Instruction:\n{row['instruction']}\n\n### Response:\n{row['response']}"}
dataset = Dataset.from_list(raw_data).map(format_example)
# Tokenize with a fixed max length to control activation memory
def tokenize(batch):
return tokenizer(
batch["text"],
truncation=True,
max_length=512, # lower this first if you hit OOM
padding="max_length",
)
tokenized = dataset.map(tokenize, batched=True, remove_columns=["instruction", "response", "text"])Keep max_length as low as your task allows. Activation memory scales with sequence length squared in attention layers, so halving the sequence length can cut activation memory by roughly 4x.
Configuring the Training Run
The SFTTrainer from the trl library wraps Trainer with sensible defaults for supervised fine-tuning. A few arguments matter most for VRAM.
from transformers import TrainingArguments
from trl import SFTTrainer
training_args = TrainingArguments(
output_dir="./qlora-output",
per_device_train_batch_size=2, # start here; lower to 1 if OOM
gradient_accumulation_steps=8, # effective batch = 2 * 8 = 16
num_train_epochs=3,
learning_rate=2e-4,
fp16=False,
bf16=True, # bfloat16 is more stable than fp16
logging_steps=25,
save_strategy="epoch",
optim="paged_adamw_8bit", # 8-bit optimizer from bitsandbytes
gradient_checkpointing=True, # recompute activations to save memory
warmup_ratio=0.03,
lr_scheduler_type="cosine",
report_to="none",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=tokenized,
dataset_text_field="text", # only needed if you skip pre-tokenization
max_seq_length=512,
)
trainer.train()Three settings here do the heavy lifting for VRAM. Gradient checkpointing discards intermediate activations during the forward pass and recomputes them during the backward pass, trading compute time for memory. Paged AdamW 8-bit quantizes optimizer states and pages them to CPU when GPU pressure spikes. Gradient accumulation lets you simulate a larger batch without storing multiple batches in VRAM simultaneously.
Hyperparameters Worth Tuning First
LoRA Rank (r)
Rank 8 is often enough for domain adaptation β teaching the model your vocabulary and tone. Rank 16β32 is better if you need it to learn new reasoning patterns or a highly specialized output format. Higher rank means more adapter parameters and slightly more memory, but the difference between rank 8 and rank 32 is usually under 200 MB.
Target Modules
By default, most guides target only q_proj and v_proj. Including k_proj, o_proj, and the MLP projections (gate_proj, up_proj, down_proj) gives you more capacity and often better results, at the cost of more trainable parameters. Run model.named_modules() to see the exact names for your architecture.
Sequence Length and Batch Size
These two levers have the biggest impact on VRAM after the base model size. Shrink sequence length first. Then reduce batch size to 1 and rely on gradient accumulation to maintain an effective batch size of 8β16. A real batch of 1 with 16 accumulation steps is almost identical in training dynamics to a real batch of 16.
Common Pitfalls
Forgetting to call prepare_model_for_kbit_training
When using 4-bit quantization, you need to call prepare_model_for_kbit_training(model) before attaching LoRA. Skipping this leaves certain layer norms in a non-trainable dtype that causes silent NaN losses or crashes.
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)Labels not being set correctly
For causal LM training, the labels should be identical to the input IDs, shifted by one position. SFTTrainer handles this automatically, but if you use a raw Trainer, you must set labels=input_ids in your data collator or you'll train against a zero loss.
Tokenizer padding side
Decoder-only models expect padding on the left, not the right. Set tokenizer.padding_side = "left" before tokenizing your dataset. Right-padding pushes real tokens away from the attention window and degrades generation quality.
Saving and loading only the adapter
When you call trainer.save_model() after a QLoRA run, only the adapter weights are saved β not the full model. To run inference, you load the base model in 4-bit again and then load the adapter on top:
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
model = PeftModel.from_pretrained(base, "./qlora-output")
model.eval()Monitoring VRAM During Training
Before you start a long run, do a quick memory check after the first forward-backward step. The torch.cuda.memory_summary() call gives you a breakdown of what is allocated versus reserved. A simpler one-liner:
print(f"{torch.cuda.memory_allocated() / 1e9:.2f} GB allocated")
print(f"{torch.cuda.memory_reserved() / 1e9:.2f} GB reserved")Reserved memory includes PyTorch's caching allocator buffer. If reserved exceeds your card's total VRAM, you will hit OOM on the next allocation spike. Lower batch size or sequence length and retry.
Wrapping Up
You now have a working path from raw dataset to fine-tuned adapter that fits inside a normal developer GPU. Here are the concrete next steps:
- Start with a smaller model. A 3B model fine-tunes faster, uses less VRAM, and often performs surprisingly well on narrow tasks. Move up to 7B only if quality is insufficient.
- Validate your data format first. Print 3β5 tokenized examples and decode them before training. Garbled prompts or wrong padding will waste hours of compute.
- Run one epoch on a 500-row subset to confirm loss decreases before committing to a full run.
- Merge and export for production. After training, merge the adapter weights back into the base model with
model.merge_and_unload()and save as a standard Hugging Face model. This removes the PEFT dependency at inference time. - Evaluate on a held-out set with task-specific metrics β not just loss. Perplexity tells you the model learned the distribution; accuracy or ROUGE tells you whether it learned your task.
π€ Share this article
Sign in to saveComments (0)
No comments yet. Be the first!