As of November 2025, many teams are hitting a hard ceiling with LLM costs. Long, repetitive system prompts, few-shot examples, and API docs routinely push requests into the tens of thousands of tokens. Even with prompt caching and cheaper models, input-side bloat can dominate your bill and latency. PromptIntern offers a different path: instead of endlessly sending the same instructions, you progressively “teach” the model those prompts during fine-tuning so that, at inference time, you can send just the user query. Experiments in the original PromptIntern paper (Findings of EMNLP 2024) show over 90% input token reduction, 4.2× faster inference, and around 88% cost savings, while preserving or even improving task accuracy. This guide explains how to apply PromptIntern’s Progressive Fine-tuning to your own specialized agents.
What PromptIntern is and why it cuts LLM costs
PromptIntern (“Prompt Intern”) is a prompt internalization method introduced by Zou et al. at EMNLP 2024. Instead of treating prompts as immutable text you always send to the model, PromptIntern:
- Splits each training prompt into three parts: template (instructions/docs), examples (few-shot demos), and query (user input).
- Runs a progressive fine-tuning schedule that gradually shortens/removes templates and examples during training.
- Internalizes that repeated context into the model’s weights (typically via PEFT/LoRA) so that, at inference, you can send query-only prompts.
In their NL2Code experiments (MBPP, NL2Formula, NL2Bash) using GPT‑3.5, GPT‑4, Mixtral‑8x7B, Llama 2‑7B/13B and others, PromptIntern achieved:
- >90% reduction in input tokens with query-only inference.
- 4.2× faster average inference latency.
- ~88.3% lower monetary inference costs compared to direct fine-tuning with full prompts.
- Equal or better task performance versus strong prompt compression baselines like LongLLMLingua/LLMLingua‑2.
Crucially, this is not just text compression. Standard prompt compression methods shorten the text but still require you to send compressed prompts at inference. PromptIntern instead moves the knowledge out of the prompt and into the model, so you stop paying for it per request.
How PromptIntern’s Progressive Fine-tuning works
The core of PromptIntern is a structured, multi-stage fine-tuning pipeline. You can implement the same pattern on modern models such as GPT‑4.1 mini / GPT‑4o‑mini (via OpenAI’s SFT/DPO APIs), Claude Sonnet 4.5 (via providers like Amazon Bedrock), or open-source models like Llama 3.1/3.3 70B.

1. Decompose your prompts: template, examples, query
Start by formalizing each training prompt x as a tuple:
x_tmp(template): fixed instructions, system message, API docs, formatting rules.x_egs(examples): few-shot demonstrations: input → output pairs.x_que(query): user’s actual question or task.
For an analytics agent answering spreadsheet questions, a full prompt might look like:
[Template]
You are an advanced data analyst...
Here is the Excel formulas API reference...
Formatting rules: respond as ```formula <code>``` only.
[Examples]
## Example 1
[NL]...[/NL]
[TABLE]...[/TABLE]
Output: ```formula ...```
...
## Example 10
...
[Query]
[NL] Who was the home team on February 3? [/NL]
[TABLE] ... [/TABLE]In PromptIntern terms:
x_tmp= everything in[Template](instructions + API docs).x_egs= allExample iblocks.x_que= the[Query]section.
2. Template compression with a decreasing schedule
Templates are long, mostly static across requests, and expensive. PromptIntern applies a template compressor C (e.g. LLMLingua‑2, summarization model, or hand-crafted pruning) parameterized by a compression rate τ_tmp:
\tilde{x_tmp} = C(x_tmp, τ_tmp)Then you define a schedule S_tmp(t) over training iterations/epochs t = 0..T:
- At
t = 0:τ_tmp = 1.0(full template, no compression). - Linearly decrease
τ_tmpacross stages: e.g. 1.0 → 0.6 → 0.3 → 0.0. - At final stage:
τ_tmp = 0→ no template at all in the input.
The EMNLP 2024 paper compared linear, exponential, and inverse-exponential schedules and found linear decay gave the most stable accuracy across benchmarks.
3. Example absorption with retrieval and removal
Few-shot examples are powerful but extremely token-heavy. PromptIntern introduces:
- Example retrieval: For each training instance, retrieve the top‑
kmost relevant examples from your dataset using a similarity function (e.g., BLEU on outputs, embedding similarity, or domain-specific metrics). - Example removal: Apply a schedule
S_egs(t)that decreaseskover training: e.g. 10 → 5 → 2 → 0.
Formally, for instance i with label y_i:
x_egs^i(t) = TopK_examples(D_train \ {i}, k(t), s(y_i, y_j))Early on, the model sees many high-quality demonstrations. Later, as it internalizes those patterns, you remove them. At the end, k = 0, so only the query is left.
4. Progressive fine-tuning loop
Putting it together, PromptIntern’s training loop (simplified) is:
for t in training_stages:
τ_tmp = S_tmp(t)
k = S_egs(t)
for (x_tmp, x_egs, x_que, y) in D_train:
# Preprocess prompt for this stage
x_tmp_t = C(x_tmp, τ_tmp)
x_egs_t = retrieve_examples(x_egs, y, k)
prompt_t = concat(x_tmp_t, x_egs_t, x_que)
loss = L(f_θ(prompt_t), y)
θ = θ - η * ∇_θ lossKey implementation details from the paper that you should mirror:
- Use PEFT, typically LoRA, so you only train low-rank adapters instead of all weights (LoRA remains well-supported in 2025 across Llama 3.x, Mistral, etc.).
- Keep the number of epochs similar to your baseline direct fine-tuning. You are changing inputs, not overtraining.
- Always keep the query format stable (same markers, slots, etc.), because that’s what you’ll use in production.
5. Query-only inference
After the final stage, the fine-tuned model f_{θ_T} has internalized both the template and the examples. At inference time you call:
ŷ = f_{θ_T}(x_que)No template. No docs. No examples. Just the current user query, ideally plus minimal routing metadata. In the paper, this compressed prompt regime achieved:
- 5.3×–9.3× average compression across NL2Code tasks.
- Strong accuracy versus “template with 5–10 shots” baselines while cutting tokens by 9.8×–27.4× in some settings.
Implementing PromptIntern for your stack (step-by-step)
This section gives a concrete implementation path you can adapt whether you are using OpenAI, Anthropic via Bedrock, or open-source models on your own GPUs.
Step 1: Choose your base model and deployment mode
- Cloud APIs (minimal infra)
- OpenAI:
gpt-4.1-mini-2025-04-14orgpt-4o-minifor supervised fine-tuning via the OpenAI fine-tuning API (see OpenAI’s 2025 fine-tuning docs). - Anthropic: Claude Sonnet 4.5 fine-tuning is supported via Amazon Bedrock and partner platforms.
- OpenAI:
- Self-hosted open-source
- Meta Llama 3.3 70B Instruct or Llama 3.1 8B/70B for high quality at lower cost (as of late 2025, 3.3‑70B matches or beats many proprietary mid‑tier models in benchmarks while being far cheaper per token).
- Other PEFT-friendly models like Mistral, Qwen, or Mixtral.
For large-scale cost savings, a common pattern is:
- Use Llama 3.3 70B (or similar) with PromptIntern as your task-specialist agent for high-volume workloads.
- Reserve frontier proprietary models (Claude Sonnet 4.5, GPT‑4.1) for low-volume, highest-criticality flows.
Step 2: Build a prompt schema and dataset
Formalize your production prompts into a schema that explicitly tags template, examples, and query. For each training sample:
- Extract
x_tmpfrom your current system prompt / instructions / docs. - Collect a pool of
x_egsfew-shot examples (you’ll use retrieval to pick subsets per instance). - Use real historical
x_quefrom logs wherever possible. - Store the ground truth output
y(human-validated answers, gold labels, etc.).
Format as JSONL suitable for your fine-tuning provider (e.g., OpenAI’s chat fine-tuning format or a simple input → output pair for open-source training).
Step 3: Implement template compression
You have three pragmatic options:
- Rule-based pruning: Strip redundant explanations, keep only constraints and key fields; effective and easy to control.
- LLM-based summarization: Run templates through an in-house model (e.g., Llama 3.1 8B) that compresses docs while preserving API semantics.
- Existing compressors: Integrate LLMLingua‑2 (used in PromptIntern’s baselines) to automatically select important tokens.
For each training stage t, apply the compressor with the appropriate τ_tmp from your linear schedule and cache the results so you don’t re-run compression on every epoch.
Step 4: Implement example retrieval and scheduling
Use an embedding-based index (e.g., FAISS, Elasticsearch with dense vectors, or a vector DB) over your labeled examples. For each training instance, retrieve top‑k(t) examples using cosine similarity on either:
- The natural language questions.
- The labeled outputs
y(PromptIntern used BLEU on code outputs).
Then define a simple linear schedule for k, such as:
- Stage 0–N1: k = 10 or 5.
- Stage N1–N2: k = 5 or 2.
- Stage N2–final: k = 0.
Experiments in the paper show that using a larger retrieval bank (up to 100% of the training set) improves performance, so avoid over-pruning your candidate pool.

Step 5: Fine-tune with PEFT / LoRA
For open-source models, use a framework like Hugging Face Transformers with PEFT/LoRA:
# Pseudocode: Llama 3.3 70B + LoRA + PromptIntern
for stage in stages:
τ_tmp = template_schedule[stage]
k = example_schedule[stage]
for batch in dataloader:
x_tmp, x_egs, x_que, y = batch
x_tmp_t = compress_template(x_tmp, τ_tmp)
x_egs_t = retrieve_examples(x_egs, y, k)
prompts = build_prompts(x_tmp_t, x_egs_t, x_que)
loss = model(prompts, labels=y).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()For OpenAI’s SFT on gpt-4o-mini or gpt-4.1-mini, preprocess each stage into a separate training file (JSONL) that already contains the compressed template and scheduled examples, and start a fine-tuning job via the fine-tuning API shown in OpenAI’s 2025 docs.
Step 6: Switch your agents to query-only prompts
Once validation shows the final stage (no template, no examples) matches your accuracy targets:
- Deploy the fine-tuned adapter/model into your serving stack (e.g., v2 of your “code-agent”).
- Update your routing/orchestration so that for that agent you now send only
x_que(the user query plus minimal task metadata). - Keep your old full-prompt baseline live behind a feature flag for A/B safety fallbacks.
Measure API token usage before/after; if your agent previously sent 3–5k tokens per call, expect input token counts to fall into the low hundreds, with corresponding cost and latency drops.
How PromptIntern compares to other cost-reduction methods
PromptIntern should sit alongside, not replace, other LLM cost strategies. Here’s how it compares.
| Technique | What it optimizes | Typical savings | Pros | Cons |
|---|---|---|---|---|
| PromptIntern (Progressive Fine-tuning) | Input tokens via internalizing prompts | >90% input reduction, ~4.2× speedup, ~88% cost (Zou et al., 2024) | Stable accuracy, no prompts at inference, works with PEFT | Requires fine-tuning pipeline, task-specific |
| Prompt compression (LLMLingua‑2, etc.) | Prompt length | 2–4× token cut | Easy to drop-in, no fine-tuning needed | Still pay per request; can hurt accuracy |
| Prompt caching (OpenAI, Claude, Gemini) | Repeated prefixes | Up to 50–90% on cached segments | Great for static system prompts and RAG contexts | Helps only when reuse is high; doesn’t solve long queries |
| Model routing / smaller models | Per-token price | 30–70% vs frontier models | Simple to implement, big wins for “easy” tasks | Doesn’t solve long prompts; quality may drop |
| RAG instead of fine-tuning | Knowledge freshness / storage | Varies; can be cheaper than full FT | No retrain for knowledge updates | Still sends long retrieved docs; latency and tokens grow |
The unique advantage of PromptIntern is that it targets the most stubborn cost component: long, repetitive templates and examples that appear on every call. Once internalized, they cost you nothing per request.
Best practices, pitfalls, and when to use PromptIntern
When PromptIntern is a good fit
- You have stable tasks and prompts: e.g., NL→Code, NL→SQL, spreadsheet/BI agents, support macros, report generation.
- Your prompts exceed ~1–2k tokens due to instructions, docs, or many examples.
- You’re already paying for fine-tuning or have GPU capacity for PEFT.
- Your workload is high-volume enough that per-request token savings dwarf one-time fine-tuning costs.
Common pitfalls
- Skipping the progressive schedule: Training with full prompts and then suddenly switching to query-only inference performs poorly; the paper’s ablations show large drops vs PromptIntern’s staged approach.
- Over-aggressive compression: Setting
τ_tmptoo low too early or too small a retrieval bank can hurt performance. Start with moderate compression and test. - Forgetting evaluation: Always keep a held-out test set and compare against your original prompt baseline so you don’t trade away accuracy for savings unknowingly.
Combining PromptIntern with other optimizations
- Use PromptIntern to strip templates/examples, then still apply prompt caching for user-level or RAG-level context that remains.
- Route simple queries to small, cheap models (Llama 3.2 3B, GPT‑4.1‑nano) and reserve the PromptIntern‑tuned specialist for complex tasks.
- Pair with RAG for knowledge that truly changes often, while internalizing stable decision rules and formats.
Conclusion: Turning your prompts into weights for 90%+ savings
LLM cost reduction in 2025 is no longer just about cheaper models and caching. If your agents still drag around kilobytes of static instructions and examples on every call, you’re paying the “prompt tax.” PromptIntern’s Progressive Fine-tuning offers a practical, research-backed way to eliminate that tax by internalizing your prompts into the model’s weights. Experiments show you can cut input tokens by more than 90%, reduce latency by around 4×, and trim billable inference costs by nearly 88%, while matching or beating baseline accuracy.
To apply PromptIntern in your own stack, start by decomposing prompts into template, examples, and query; design linear schedules for template compression and example removal; and fine-tune with PEFT on a modern model like Llama 3.3 or GPT‑4o‑mini. Once validated, flip your production agents to query-only prompts and measure the impact. Combined with existing strategies like prompt caching, model routing, and RAG, PromptIntern can be the lever that finally brings your LLM costs and latency under control without sacrificing capability.