DoRA Fine-Tuning: Better Than LoRA? A Hugging Face Guide

LoRA has become the de facto standard for efficient LLM fine-tuning. Yet many teams still see a stubborn accuracy gap compared to full fine-tuning, especially on reasoning-heavy or multimodal tasks. As of November 2025, DoRA (Weight-Decomposed Low-Rank Adaptation, ICML 2024) is emerging as a powerful upgrade: it keeps LoRA’s efficiency and zero-latency inference while closing much of that performance gap by separately updating weight magnitude and direction. In this guide, you’ll learn how DoRA works, why it can outperform LoRA, and how to enable DoRA in the Hugging Face PEFT library with only minimal code changes. We’ll focus on practical LLM fine-tuning workflows (e.g. LLaMA-family models) using PEFT and Transformers.

What is DoRA and why it beats LoRA in many settings

DoRA was introduced in the 2024 paper “DoRA: Weight-Decomposed Low-Rank Adaptation” (Liu et al., arXiv:2402.09353, ICML 2024 oral). The core idea is to decompose each pretrained weight matrix W into two parts:

Magnitude (a scalar or diagonal-like component controlling the norm of each weight vector)
Direction (the normalized weight vectors themselves)

Instead of only learning a low-rank delta on W like LoRA, DoRA:

Updates the direction with a LoRA-style low-rank adapter
Separately learns a small set of magnitude parameters

This lets DoRA mimic the learning behavior of full fine-tuning more closely. The NVIDIA blog post “Introducing DoRA, a High-Performing Alternative to LoRA for Fine-Tuning” (June 28, 2024) highlights several results:

Commonsense reasoning (BoolQ, PIQA, SIQA, etc.): DoRA improves average scores by ~3–5 points over LoRA on LLaMA 1/2/3 models, with similar adapter parameter counts.
Vision-language tasks: DoRA outperforms LoRA on image-text and video-text benchmarks using VL-BART and LLaVA.
Compression-aware setups (QDoRA): Replacing LoRA with DoRA within QLoRA yields better math performance than both FT and QLoRA on Orca-Math.

Critically, like LoRA, DoRA can be merged back into the base weights after training. That means no additional inference latency or memory versus a standard dense model.

LoRA vs DoRA: conceptual and practical differences

To understand why DoRA often outperforms LoRA, it helps to contrast the two at both the math and systems level.

Aspect	LoRA	DoRA
Core idea	Learn low-rank delta `ΔW = A B` on full weight	Decompose `W = M · D` and learn low-rank delta on direction + separate magnitude
Capacity vs full FT	Often underfits in some layers; struggles to match FT	Learning patterns closer to FT (similar magnitude–direction tradeoff)
Trainable params	Low (e.g. <1% of base model)	Similar order of magnitude; can match or beat LoRA at equal or lower rank
Training stability	Can show correlated magnitude/direction updates that hurt convergence	Separates magnitude and direction, improving stability
Inference overhead	Zero after merging adapters	Zero after merging magnitude+direction back into W
Support in PEFT	Fully supported (core method)	Supported since PEFT v0.9.0 and later via a config flag

Empirically (from the DoRA paper and NVIDIA blog):

DoRA consistently beats LoRA on LLaMA-family commonsense reasoning benchmarks, with reported gains like +3.7 points on LLaMA-7B and +4.4 on LLaMA 3 8B.
At lower rank (e.g. r=4 or r=8), DoRA maintains performance where LoRA degrades more, meaning you can save parameters and memory with similar or better accuracy.

When to prefer DoRA over LoRA

You’re hitting a performance ceiling with LoRA on reasoning, instruction-following, or multimodal tasks.
You want FT-like quality but have PEFT constraints (limited VRAM, many tasks/adapters).
You already have a PEFT + LoRA pipeline and want a near drop-in upgrade with minimal code changes.

How DoRA works under the hood (intuitive view)

In standard LoRA, a weight matrix W (e.g. attention projection) is adapted as:

W' = W + A B    # A ∈ R^{d×r}, B ∈ R^{r×k}, rank r ≪ min(d, k)

This works well but constrains updates to a low-rank subspace of the full parameter space. The DoRA paper shows that full fine-tuning tends to adjust both the norms and directions of weight vectors in a way LoRA doesn’t reproduce.

DoRA instead reparameterizes:

W = M · D, where M encodes magnitudes (e.g. per-row or per-column scaling), and D is a direction matrix with normalized rows or columns.
Train a LoRA adapter on D and a small set of additional parameters for M.
At inference, W' is reconstructed and merged back: W' = (M + ΔM) · (D + ΔD), so you only keep a dense matrix, no runtime adapter.

The key observation in the paper and NVIDIA’s figures is that DoRA’s updates in the (Δdirection, Δmagnitude) space align much more closely with full FT than LoRA’s, leading to more expressive yet still efficient adaptation.

Conceptual architecture diagram showing a pretrained weight matrix W decomposed into magnitude M and direction D, with a low-rank LoRA adapter applied only to D and a small magnitude update on M. Arrows indicate merging M and D back into a single dense W' for inference. — Conceptual view of DoRA: decompose weights into magnitude and direction, adapt both efficiently, then merge for zero-latency inference.

PEFT and DoRA: current status and versions

Before implementing DoRA, you need a recent PEFT version. According to the PEFT GitHub releases (accessed November 2025) and the DoRA project page:

DoRA support was added in PEFT v0.9.0 (early 2024) as a new PEFT method “that can overcome the limit of low rank adaptations as seen in LoRA”.
PEFT continues to evolve; recent releases in 2025 still include LoRA and many variants. Ensure you install a ≥0.9.0 version to use DoRA.

As of November 2025, recommended setup:

pip install -U "transformers" "peft" "accelerate" "datasets" bitsandbytes

PEFT integrates with Transformers (see Transformers PEFT guide) to make attaching adapters to popular LLMs straightforward.

How to switch from LoRA to DoRA in Hugging Face PEFT

The good news: if you’re already using LoRA via PEFT, moving to DoRA typically requires only a single config flag change. The NVlabs/DoRA repo and community examples document that DoRA in PEFT is enabled through the LoRA configuration itself.

Baseline: standard LoRA fine-tuning with PEFT

This is the “before” setup, adapted from the PEFT configuration tutorial and Transformers PEFT docs.

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, TaskType, get_peft_model

model_name = "meta-llama/Llama-3-8b"  # example; use a model you have access to

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,              # optional, for QLoRA-style setups
    device_map="auto",
)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()   # sanity-check % of trainable params

# ... set up dataset and TrainingArguments ...

trainer = Trainer(
    model=model,
    # train_dataset=train_dataset,
    # eval_dataset=eval_dataset,
    # args=training_args,
)
trainer.train()

Enabling DoRA: minimal code changes

In PEFT ≥0.9.0, DoRA is typically controlled via an extra parameter in LoraConfig (commonly named use_dora in the official DoRA examples):

from peft import LoraConfig, TaskType, get_peft_model

dora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    use_dora=True,  # <-- key change: enable DoRA instead of plain LoRA
)

model = get_peft_model(model, dora_config)

The rest of your training pipeline (datasets, Trainer, logging, etc.) remains unchanged. Internally, PEFT will:

Reparameterize the targeted layers (e.g. attention projections) into magnitude + direction form.
Attach a directional low-rank adapter plus magnitude parameters.
Ensure that at save/merge time, the magnitude and direction are fused back into a standard dense weight.

Side-by-side comparison chart of LoRA and DoRA configurations in Hugging Face PEFT. Left: LoraConfig without use_dora flag. Right: LoraConfig with use_dora=True, same other hyperparameters and target_modules. — LoRA vs DoRA in PEFT: switching to DoRA typically means adding a single configuration flag.

Choosing ranks and targets for DoRA

Based on DoRA’s benchmarks and follow-up posts:

Rank (r): DoRA often matches or beats LoRA at half the rank. If you were using r=16 with LoRA, you can often try r=8 with DoRA for similar or better accuracy and fewer parameters.
Target modules: For LLMs, the standard choice is still attention projections (q_proj, k_proj, v_proj, o_proj) and sometimes MLP projections (gate_proj, up_proj, down_proj) for more capacity.
Dropout and alpha: Start with your LoRA-tuned values; DoRA is usually robust to the same hyperparameters.

End-to-end example: DoRA fine-tuning a causal LLM

This example outlines a full DoRA fine-tuning script compatible with current PEFT and Transformers. It assumes a causal language modeling task (instruction tuning, domain adaptation, etc.).

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
)
from peft import LoraConfig, TaskType, get_peft_model
from datasets import load_dataset

model_name = "meta-llama/Llama-3-8b"  # replace with an accessible model
dataset_name = "tatsu-lab/alpaca"     # example; use your own dataset

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 1. Load base model (optionally quantized)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto",
)

# 2. Configure DoRA via PEFT LoRA config
dora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,                     # try 4–16; DoRA can often use lower ranks
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    use_dora=True,           # <-- activate DoRA
)

model = get_peft_model(model, dora_config)
model.print_trainable_parameters()

# 3. Prepare dataset
raw_ds = load_dataset(dataset_name)

def format_example(example):
    prompt = example["instruction"]
    if example.get("input"):
        prompt += "\n" + example["input"]
    target = example["output"]
    text = f"<|user|>: {prompt}\n<|assistant|>: {target}"
    tokens = tokenizer(
        text,
        truncation=True,
        max_length=1024,
        padding="max_length",
    )
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

tokenized_ds = raw_ds["train"].map(format_example, remove_columns=raw_ds["train"].column_names)

# 4. Training configuration
training_args = TrainingArguments(
    output_dir="./dora-llama3-alpaca",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    logging_steps=50,
    save_steps=1000,
    learning_rate=2e-4,
    fp16=True,
    bf16=False,
    report_to="none",
)

# 5. Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds,
)
trainer.train()

# 6. Save merged weights or adapter
model.save_pretrained("./dora-llama3-alpaca")

PEFT provides utilities to merge the DoRA adapter into the base model weights for pure dense deployment, eliminating runtime adapter overhead. Check the PEFT model merging guide in the docs for up-to-date APIs and examples.

Flowchart showing a DoRA fine-tuning workflow: 1) Load pretrained LLM, 2) Attach DoRA adapter via PEFT LoraConfig with use_dora=True, 3) Train on task-specific data, 4) Optionally merge magnitude and direction back into dense weights, 5) Deploy with no extra inference latency. — DoRA fine-tuning workflow using Hugging Face Transformers and PEFT.

Practical tips for getting the most from DoRA

To make DoRA fine-tuning effective and stable in real projects:

Start from your best LoRA baseline and only change use_dora=True. This isolates the impact of DoRA.
Sweep rank values (e.g. 4, 8, 16). Expect DoRA to hold up better than LoRA as you go lower.
Monitor training loss and downstream metrics; DoRA often converges faster or to a lower loss than LoRA on the same data.
When combining with quantization (QLoRA-style), consider a QDoRA setup: low-bit base model + DoRA adapters, preferably with strong regularization and careful LR schedules.
For complex tasks (multi-turn chat, multimodal), expand target modules to relevant projections (e.g. cross-attention or vision encoder layers) to fully exploit DoRA’s extra capacity.

In all these cases, inference remains as fast as the underlying dense model once you merge adapters, which is one of the key selling points of both LoRA and DoRA.

Conclusion: is DoRA “better than LoRA” for your LLM fine-tuning?

DoRA was designed to address a very specific weakness of classic LoRA: the remaining gap to full fine-tuning on challenging tasks, even when training is stable and efficient. By decomposing each weight into magnitude and direction, and learning both with a mix of scalar and low-rank updates, DoRA recovers much of full FT’s learning behavior while keeping PEFT’s small-parameter footprint and zero-latency inference.

As of late 2025, Hugging Face PEFT makes DoRA effectively a toggle on your existing LoRA setup, via a simple configuration flag in LoraConfig. If you’re already investing in LoRA-based LLM fine-tuning, you can:

Clone your best LoRA experiments and re-run with DoRA enabled.
Compare accuracy, convergence speed, and robustness at equal or reduced ranks.
Deploy merged DoRA weights with no inference penalty over your current LoRA or FT baselines.

In many workloads, DoRA will be a drop-in upgrade: better performance for roughly the same cost. That makes it a strong candidate to become the new default in parameter-efficient LLM fine-tuning workflows built on Hugging Face PEFT.

Better Than LoRA? A How-To Guide to DoRA Fine-Tuning