Fine-Tuning a Small LLM on Your Data Without Running Out of VRAM

You have a small language model, a dataset you care about, and a GPU that tops out at 12 GB of VRAM. The moment you kick off a full fine-tuning run, your process dies with a CUDA out-of-memory error before the first epoch finishes. Sound familiar?

The good news is that full fine-tuning is rarely what you actually need. A handful of memory-efficient techniques — used together — let you adapt a capable 7-billion-parameter model to your specific task on hardware you already own.

What you'll learn

Why full fine-tuning is so memory-hungry and what actually consumes VRAM
How QLoRA (Quantized Low-Rank Adaptation) cuts memory use without wrecking model quality
How to set up a training run with the Hugging Face transformers, peft, and bitsandbytes libraries
Which hyperparameters matter most on a tight VRAM budget
Common pitfalls that silently inflate memory usage

Prerequisites

You'll need Python 3.10+, a CUDA-capable GPU with at least 8 GB of VRAM, and the following packages installed:

pip install transformers==4.40.0 peft bitsandbytes accelerate datasets trl

The examples below use a 7B-parameter model in the Mistral or Llama family, but the same approach works for any model in the 3B–13B range that Hugging Face hosts.

Why Full Fine-Tuning Runs You Out of Memory

When you train a model, GPU memory holds several things at once: the model weights themselves, optimizer states (Adam stores two momentum tensors per parameter), gradients, and activations from the forward pass. For a 7B-parameter model in 32-bit precision, weights alone take roughly 28 GB. Add optimizer states and activations, and you need 80+ GB just to start.

Reducing precision to 16-bit halves weight storage, but optimizer states are still large, and the total is still well beyond a single consumer card. That is why you need a different approach entirely rather than just a lower precision setting.

The Core Idea Behind LoRA and QLoRA

Low-Rank Adaptation (LoRA) freezes the original model weights and injects small trainable matrices into the attention layers. Instead of updating 7 billion parameters, you update a few million — typically less than 1% of the total. The frozen base weights never accumulate gradients, so their memory footprint stays static.

QLoRA takes this further by loading the base model in 4-bit NormalFloat (NF4) quantization. A 7B model in NF4 sits at roughly 4–5 GB. Your trainable LoRA adapters add another 100–300 MB depending on rank. Activations and optimizer states for only the adapter parameters are small enough that 8–12 GB of VRAM is genuinely sufficient.

Quantization compresses the base weights for storage and forward-pass computation. Gradients are still computed and stored in 16-bit for numerical stability — this is the "double quantization" trick QLoRA uses.

Loading Your Model in 4-Bit

The BitsAndBytesConfig object tells the loader how to quantize the base model on the way in. You then attach LoRA adapters on top with get_peft_model.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch

model_id = "mistralai/Mistral-7B-v0.1"  # swap for any compatible model

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # saves ~0.4 bits per parameter extra
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token  # required for batch padding

lora_config = LoraConfig(
    r=16,               # rank — higher = more capacity, more memory
    lora_alpha=32,      # scaling factor, often set to 2 * r
    target_modules=["q_proj", "v_proj"],  # which attention projections to adapt
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Typical output: trainable params: 4,194,304 || all params: 3,752,071,168 || trainable%: 0.1118

Setting device_map="auto" lets Accelerate spread layers across GPU and CPU RAM if needed. On an 8 GB card you usually fit the entire quantized 7B model on GPU without spilling.

Preparing Your Dataset

The format expected by causal language models is a single text string per example. For instruction-following tasks, the standard is to concatenate a prompt and a completion with a clear separator your tokenizer won't split awkwardly.

from datasets import Dataset

# Example: a list of dicts with "instruction" and "response" keys
raw_data = [
    {"instruction": "Summarize the following contract clause:", "response": "The clause states..."},
    # ... thousands more rows
]

def format_example(row):
    return {"text": f"### Instruction:\n{row['instruction']}\n\n### Response:\n{row['response']}"}

dataset = Dataset.from_list(raw_data).map(format_example)

# Tokenize with a fixed max length to control activation memory
def tokenize(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        max_length=512,  # lower this first if you hit OOM
        padding="max_length",
    )

tokenized = dataset.map(tokenize, batched=True, remove_columns=["instruction", "response", "text"])

Keep max_length as low as your task allows. Activation memory scales with sequence length squared in attention layers, so halving the sequence length can cut activation memory by roughly 4x.

Configuring the Training Run

The SFTTrainer from the trl library wraps Trainer with sensible defaults for supervised fine-tuning. A few arguments matter most for VRAM.

from transformers import TrainingArguments
from trl import SFTTrainer

training_args = TrainingArguments(
    output_dir="./qlora-output",
    per_device_train_batch_size=2,      # start here; lower to 1 if OOM
    gradient_accumulation_steps=8,     # effective batch = 2 * 8 = 16
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=False,
    bf16=True,                          # bfloat16 is more stable than fp16
    logging_steps=25,
    save_strategy="epoch",
    optim="paged_adamw_8bit",           # 8-bit optimizer from bitsandbytes
    gradient_checkpointing=True,        # recompute activations to save memory
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    dataset_text_field="text",          # only needed if you skip pre-tokenization
    max_seq_length=512,
)

trainer.train()

Three settings here do the heavy lifting for VRAM. Gradient checkpointing discards intermediate activations during the forward pass and recomputes them during the backward pass, trading compute time for memory. Paged AdamW 8-bit quantizes optimizer states and pages them to CPU when GPU pressure spikes. Gradient accumulation lets you simulate a larger batch without storing multiple batches in VRAM simultaneously.

Hyperparameters Worth Tuning First

LoRA Rank (r)

Rank 8 is often enough for domain adaptation — teaching the model your vocabulary and tone. Rank 16–32 is better if you need it to learn new reasoning patterns or a highly specialized output format. Higher rank means more adapter parameters and slightly more memory, but the difference between rank 8 and rank 32 is usually under 200 MB.

Target Modules

By default, most guides target only q_proj and v_proj. Including k_proj, o_proj, and the MLP projections (gate_proj, up_proj, down_proj) gives you more capacity and often better results, at the cost of more trainable parameters. Run model.named_modules() to see the exact names for your architecture.

Sequence Length and Batch Size

These two levers have the biggest impact on VRAM after the base model size. Shrink sequence length first. Then reduce batch size to 1 and rely on gradient accumulation to maintain an effective batch size of 8–16. A real batch of 1 with 16 accumulation steps is almost identical in training dynamics to a real batch of 16.

Common Pitfalls

Forgetting to call prepare_model_for_kbit_training

When using 4-bit quantization, you need to call prepare_model_for_kbit_training(model) before attaching LoRA. Skipping this leaves certain layer norms in a non-trainable dtype that causes silent NaN losses or crashes.

from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

Labels not being set correctly

For causal LM training, the labels should be identical to the input IDs, shifted by one position. SFTTrainer handles this automatically, but if you use a raw Trainer, you must set labels=input_ids in your data collator or you'll train against a zero loss.

Tokenizer padding side

Decoder-only models expect padding on the left, not the right. Set tokenizer.padding_side = "left" before tokenizing your dataset. Right-padding pushes real tokens away from the attention window and degrades generation quality.

Saving and loading only the adapter

When you call trainer.save_model() after a QLoRA run, only the adapter weights are saved — not the full model. To run inference, you load the base model in 4-bit again and then load the adapter on top:

from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
model = PeftModel.from_pretrained(base, "./qlora-output")
model.eval()

Monitoring VRAM During Training

Before you start a long run, do a quick memory check after the first forward-backward step. The torch.cuda.memory_summary() call gives you a breakdown of what is allocated versus reserved. A simpler one-liner:

print(f"{torch.cuda.memory_allocated() / 1e9:.2f} GB allocated")
print(f"{torch.cuda.memory_reserved() / 1e9:.2f} GB reserved")

Reserved memory includes PyTorch's caching allocator buffer. If reserved exceeds your card's total VRAM, you will hit OOM on the next allocation spike. Lower batch size or sequence length and retry.

Wrapping Up

You now have a working path from raw dataset to fine-tuned adapter that fits inside a normal developer GPU. Here are the concrete next steps:

Start with a smaller model. A 3B model fine-tunes faster, uses less VRAM, and often performs surprisingly well on narrow tasks. Move up to 7B only if quality is insufficient.
Validate your data format first. Print 3–5 tokenized examples and decode them before training. Garbled prompts or wrong padding will waste hours of compute.
Run one epoch on a 500-row subset to confirm loss decreases before committing to a full run.
Merge and export for production. After training, merge the adapter weights back into the base model with model.merge_and_unload() and save as a standard Hugging Face model. This removes the PEFT dependency at inference time.
Evaluate on a held-out set with task-specific metrics — not just loss. Perplexity tells you the model learned the distribution; accuracy or ROUGE tells you whether it learned your task.

Fine-Tuning a Small LLM on Your Own Data Without Running Out of VRAM

What you'll learn

Prerequisites

Why Full Fine-Tuning Runs You Out of Memory

The Core Idea Behind LoRA and QLoRA

Loading Your Model in 4-Bit

Preparing Your Dataset

Configuring the Training Run

Hyperparameters Worth Tuning First

LoRA Rank (r)

Target Modules

Sequence Length and Batch Size

Common Pitfalls

Forgetting to call prepare_model_for_kbit_training

Labels not being set correctly

Tokenizer padding side

Saving and loading only the adapter

Monitoring VRAM During Training

Wrapping Up

Comments (0)

Leave a Comment

Fine-Tuning a Small LLM on Your Own Data Without Running Out of VRAM

What you'll learn

Prerequisites

Why Full Fine-Tuning Runs You Out of Memory

The Core Idea Behind LoRA and QLoRA

Loading Your Model in 4-Bit

Preparing Your Dataset

Configuring the Training Run

Hyperparameters Worth Tuning First

LoRA Rank (r)

Target Modules

Sequence Length and Batch Size

Common Pitfalls

Forgetting to call prepare_model_for_kbit_training

Labels not being set correctly

Tokenizer padding side

Saving and loading only the adapter

Monitoring VRAM During Training

Wrapping Up

Comments (0)

Leave a Comment

Stay ahead of the curve