LoRA has become the de facto standard for efficient LLM fine-tuning. Yet many teams still see a stubborn accuracy gap compared to full fine-tuning, especially on reasoning-heavy or multimodal tasks. As of November 2025, DoRA (Weight-Decomposed Low-Rank Adaptation, ICML 2024) is emerging as a powerful upgrade: it keeps LoRA’s efficiency and zero-latency inference while closing much of that performance gap by separately updating weight magnitude and direction. In this guide, you’ll learn how DoRA works, why it can outperform LoRA, and how to enable DoRA in the Hugging Face PEFT library with only minimal code changes. We’ll focus on practical LLM fine-tuning workflows (e.g. LLaMA-family models) using PEFT and Transformers.
What is DoRA and why it beats LoRA in many settings
DoRA was introduced in the 2024 paper “DoRA: Weight-Decomposed Low-Rank Adaptation” (Liu et al., arXiv:2402.09353, ICML 2024 oral). The core idea is to decompose each pretrained weight matrix W into two parts:
- Magnitude (a scalar or diagonal-like component controlling the norm of each weight vector)
- Direction (the normalized weight vectors themselves)
Instead of only learning a low-rank delta on W like LoRA, DoRA:
- Updates the direction with a LoRA-style low-rank adapter
- Separately learns a small set of magnitude parameters
This lets DoRA mimic the learning behavior of full fine-tuning more closely. The NVIDIA blog post “Introducing DoRA, a High-Performing Alternative to LoRA for Fine-Tuning” (June 28, 2024) highlights several results:
- Commonsense reasoning (BoolQ, PIQA, SIQA, etc.): DoRA improves average scores by ~3–5 points over LoRA on LLaMA 1/2/3 models, with similar adapter parameter counts.
- Vision-language tasks: DoRA outperforms LoRA on image-text and video-text benchmarks using VL-BART and LLaVA.
- Compression-aware setups (QDoRA): Replacing LoRA with DoRA within QLoRA yields better math performance than both FT and QLoRA on Orca-Math.
Critically, like LoRA, DoRA can be merged back into the base weights after training. That means no additional inference latency or memory versus a standard dense model.
LoRA vs DoRA: conceptual and practical differences
To understand why DoRA often outperforms LoRA, it helps to contrast the two at both the math and systems level.
| Aspect | LoRA | DoRA |
|---|---|---|
| Core idea | Learn low-rank delta ΔW = A B on full weight | Decompose W = M · D and learn low-rank delta on direction + separate magnitude |
| Capacity vs full FT | Often underfits in some layers; struggles to match FT | Learning patterns closer to FT (similar magnitude–direction tradeoff) |
| Trainable params | Low (e.g. <1% of base model) | Similar order of magnitude; can match or beat LoRA at equal or lower rank |
| Training stability | Can show correlated magnitude/direction updates that hurt convergence | Separates magnitude and direction, improving stability |
| Inference overhead | Zero after merging adapters | Zero after merging magnitude+direction back into W |
| Support in PEFT | Fully supported (core method) | Supported since PEFT v0.9.0 and later via a config flag |
Empirically (from the DoRA paper and NVIDIA blog):
- DoRA consistently beats LoRA on LLaMA-family commonsense reasoning benchmarks, with reported gains like +3.7 points on LLaMA-7B and +4.4 on LLaMA 3 8B.
- At lower rank (e.g. r=4 or r=8), DoRA maintains performance where LoRA degrades more, meaning you can save parameters and memory with similar or better accuracy.
When to prefer DoRA over LoRA
- You’re hitting a performance ceiling with LoRA on reasoning, instruction-following, or multimodal tasks.
- You want FT-like quality but have PEFT constraints (limited VRAM, many tasks/adapters).
- You already have a PEFT + LoRA pipeline and want a near drop-in upgrade with minimal code changes.
How DoRA works under the hood (intuitive view)
In standard LoRA, a weight matrix W (e.g. attention projection) is adapted as:
W' = W + A B # A ∈ R^{d×r}, B ∈ R^{r×k}, rank r ≪ min(d, k)This works well but constrains updates to a low-rank subspace of the full parameter space. The DoRA paper shows that full fine-tuning tends to adjust both the norms and directions of weight vectors in a way LoRA doesn’t reproduce.
DoRA instead reparameterizes:
W = M · D, whereMencodes magnitudes (e.g. per-row or per-column scaling), andDis a direction matrix with normalized rows or columns.- Train a LoRA adapter on
Dand a small set of additional parameters forM. - At inference,
W'is reconstructed and merged back:W' = (M + ΔM) · (D + ΔD), so you only keep a dense matrix, no runtime adapter.
The key observation in the paper and NVIDIA’s figures is that DoRA’s updates in the (Δdirection, Δmagnitude) space align much more closely with full FT than LoRA’s, leading to more expressive yet still efficient adaptation.

PEFT and DoRA: current status and versions
Before implementing DoRA, you need a recent PEFT version. According to the PEFT GitHub releases (accessed November 2025) and the DoRA project page:
- DoRA support was added in PEFT v0.9.0 (early 2024) as a new PEFT method “that can overcome the limit of low rank adaptations as seen in LoRA”.
- PEFT continues to evolve; recent releases in 2025 still include LoRA and many variants. Ensure you install a ≥0.9.0 version to use DoRA.
As of November 2025, recommended setup:
pip install -U "transformers" "peft" "accelerate" "datasets" bitsandbytesPEFT integrates with Transformers (see Transformers PEFT guide) to make attaching adapters to popular LLMs straightforward.
How to switch from LoRA to DoRA in Hugging Face PEFT
The good news: if you’re already using LoRA via PEFT, moving to DoRA typically requires only a single config flag change. The NVlabs/DoRA repo and community examples document that DoRA in PEFT is enabled through the LoRA configuration itself.
Baseline: standard LoRA fine-tuning with PEFT
This is the “before” setup, adapted from the PEFT configuration tutorial and Transformers PEFT docs.
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, TaskType, get_peft_model
model_name = "meta-llama/Llama-3-8b" # example; use a model you have access to
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=True, # optional, for QLoRA-style setups
device_map="auto",
)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # sanity-check % of trainable params
# ... set up dataset and TrainingArguments ...
trainer = Trainer(
model=model,
# train_dataset=train_dataset,
# eval_dataset=eval_dataset,
# args=training_args,
)
trainer.train()Enabling DoRA: minimal code changes
In PEFT ≥0.9.0, DoRA is typically controlled via an extra parameter in LoraConfig (commonly named use_dora in the official DoRA examples):
from peft import LoraConfig, TaskType, get_peft_model
dora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
use_dora=True, # <-- key change: enable DoRA instead of plain LoRA
)
model = get_peft_model(model, dora_config)The rest of your training pipeline (datasets, Trainer, logging, etc.) remains unchanged. Internally, PEFT will:
- Reparameterize the targeted layers (e.g. attention projections) into magnitude + direction form.
- Attach a directional low-rank adapter plus magnitude parameters.
- Ensure that at save/merge time, the magnitude and direction are fused back into a standard dense weight.

Choosing ranks and targets for DoRA
Based on DoRA’s benchmarks and follow-up posts:
- Rank (
r): DoRA often matches or beats LoRA at half the rank. If you were usingr=16with LoRA, you can often tryr=8with DoRA for similar or better accuracy and fewer parameters. - Target modules: For LLMs, the standard choice is still attention projections (
q_proj,k_proj,v_proj,o_proj) and sometimes MLP projections (gate_proj,up_proj,down_proj) for more capacity. - Dropout and alpha: Start with your LoRA-tuned values; DoRA is usually robust to the same hyperparameters.
End-to-end example: DoRA fine-tuning a causal LLM
This example outlines a full DoRA fine-tuning script compatible with current PEFT and Transformers. It assumes a causal language modeling task (instruction tuning, domain adaptation, etc.).
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
)
from peft import LoraConfig, TaskType, get_peft_model
from datasets import load_dataset
model_name = "meta-llama/Llama-3-8b" # replace with an accessible model
dataset_name = "tatsu-lab/alpaca" # example; use your own dataset
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# 1. Load base model (optionally quantized)
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_4bit=True,
device_map="auto",
)
# 2. Configure DoRA via PEFT LoRA config
dora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=8, # try 4–16; DoRA can often use lower ranks
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
use_dora=True, # <-- activate DoRA
)
model = get_peft_model(model, dora_config)
model.print_trainable_parameters()
# 3. Prepare dataset
raw_ds = load_dataset(dataset_name)
def format_example(example):
prompt = example["instruction"]
if example.get("input"):
prompt += "\n" + example["input"]
target = example["output"]
text = f"<|user|>: {prompt}\n<|assistant|>: {target}"
tokens = tokenizer(
text,
truncation=True,
max_length=1024,
padding="max_length",
)
tokens["labels"] = tokens["input_ids"].copy()
return tokens
tokenized_ds = raw_ds["train"].map(format_example, remove_columns=raw_ds["train"].column_names)
# 4. Training configuration
training_args = TrainingArguments(
output_dir="./dora-llama3-alpaca",
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
num_train_epochs=3,
logging_steps=50,
save_steps=1000,
learning_rate=2e-4,
fp16=True,
bf16=False,
report_to="none",
)
# 5. Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_ds,
)
trainer.train()
# 6. Save merged weights or adapter
model.save_pretrained("./dora-llama3-alpaca")PEFT provides utilities to merge the DoRA adapter into the base model weights for pure dense deployment, eliminating runtime adapter overhead. Check the PEFT model merging guide in the docs for up-to-date APIs and examples.

Practical tips for getting the most from DoRA
To make DoRA fine-tuning effective and stable in real projects:
- Start from your best LoRA baseline and only change
use_dora=True. This isolates the impact of DoRA. - Sweep rank values (e.g. 4, 8, 16). Expect DoRA to hold up better than LoRA as you go lower.
- Monitor training loss and downstream metrics; DoRA often converges faster or to a lower loss than LoRA on the same data.
- When combining with quantization (QLoRA-style), consider a QDoRA setup: low-bit base model + DoRA adapters, preferably with strong regularization and careful LR schedules.
- For complex tasks (multi-turn chat, multimodal), expand target modules to relevant projections (e.g. cross-attention or vision encoder layers) to fully exploit DoRA’s extra capacity.
In all these cases, inference remains as fast as the underlying dense model once you merge adapters, which is one of the key selling points of both LoRA and DoRA.
Conclusion: is DoRA “better than LoRA” for your LLM fine-tuning?
DoRA was designed to address a very specific weakness of classic LoRA: the remaining gap to full fine-tuning on challenging tasks, even when training is stable and efficient. By decomposing each weight into magnitude and direction, and learning both with a mix of scalar and low-rank updates, DoRA recovers much of full FT’s learning behavior while keeping PEFT’s small-parameter footprint and zero-latency inference.
As of late 2025, Hugging Face PEFT makes DoRA effectively a toggle on your existing LoRA setup, via a simple configuration flag in LoraConfig. If you’re already investing in LoRA-based LLM fine-tuning, you can:
- Clone your best LoRA experiments and re-run with DoRA enabled.
- Compare accuracy, convergence speed, and robustness at equal or reduced ranks.
- Deploy merged DoRA weights with no inference penalty over your current LoRA or FT baselines.
In many workloads, DoRA will be a drop-in upgrade: better performance for roughly the same cost. That makes it a strong candidate to become the new default in parameter-efficient LLM fine-tuning workflows built on Hugging Face PEFT.