How to Cut LLM Costs by 90% with PromptIntern: A Guide

2025-11-20515-promptintern-cost-saving-visual

As of November 2025, many teams are hitting a hard ceiling with LLM costs. Long, repetitive system prompts, few-shot examples, and API docs routinely push requests into the tens of thousands of tokens. Even with prompt caching and cheaper models, input-side bloat can dominate your bill and latency. PromptIntern offers a different path: instead of endlessly sending the same instructions, you progressively “teach” the model those prompts during fine-tuning so that, at inference time, you can send just the user query. Experiments in the original PromptIntern paper (Findings of EMNLP 2024) show over 90% input token reduction, 4.2× faster inference, and around 88% cost savings, while preserving or even improving task accuracy. This guide explains how to apply PromptIntern’s Progressive Fine-tuning to your own specialized agents.

What PromptIntern is and why it cuts LLM costs

PromptIntern (“Prompt Intern”) is a prompt internalization method introduced by Zou et al. at EMNLP 2024. Instead of treating prompts as immutable text you always send to the model, PromptIntern:

  • Splits each training prompt into three parts: template (instructions/docs), examples (few-shot demos), and query (user input).
  • Runs a progressive fine-tuning schedule that gradually shortens/removes templates and examples during training.
  • Internalizes that repeated context into the model’s weights (typically via PEFT/LoRA) so that, at inference, you can send query-only prompts.

In their NL2Code experiments (MBPP, NL2Formula, NL2Bash) using GPT‑3.5, GPT‑4, Mixtral‑8x7B, Llama 2‑7B/13B and others, PromptIntern achieved:

  • >90% reduction in input tokens with query-only inference.
  • 4.2× faster average inference latency.
  • ~88.3% lower monetary inference costs compared to direct fine-tuning with full prompts.
  • Equal or better task performance versus strong prompt compression baselines like LongLLMLingua/LLMLingua‑2.

Crucially, this is not just text compression. Standard prompt compression methods shorten the text but still require you to send compressed prompts at inference. PromptIntern instead moves the knowledge out of the prompt and into the model, so you stop paying for it per request.

How PromptIntern’s Progressive Fine-tuning works

The core of PromptIntern is a structured, multi-stage fine-tuning pipeline. You can implement the same pattern on modern models such as GPT‑4.1 mini / GPT‑4o‑mini (via OpenAI’s SFT/DPO APIs), Claude Sonnet 4.5 (via providers like Amazon Bedrock), or open-source models like Llama 3.1/3.3 70B.

Architecture diagram of the PromptIntern workflow showing template, examples, and query flowing through preprocess, progressive fine-tuning with schedules for template compression and example absorption, and final query-only inference on the fine-tuned LLM
High-level PromptIntern workflow: progressively internalize template and examples so inference only needs the query.

1. Decompose your prompts: template, examples, query

Start by formalizing each training prompt x as a tuple:

  • x_tmp (template): fixed instructions, system message, API docs, formatting rules.
  • x_egs (examples): few-shot demonstrations: input → output pairs.
  • x_que (query): user’s actual question or task.

For an analytics agent answering spreadsheet questions, a full prompt might look like:

[Template]
You are an advanced data analyst...
Here is the Excel formulas API reference...
Formatting rules: respond as ```formula <code>``` only.

[Examples]
## Example 1
[NL]...[/NL]
[TABLE]...[/TABLE]
Output: ```formula ...```
...
## Example 10
...

[Query]
[NL] Who was the home team on February 3? [/NL]
[TABLE] ... [/TABLE]

In PromptIntern terms:

  • x_tmp = everything in [Template] (instructions + API docs).
  • x_egs = all Example i blocks.
  • x_que = the [Query] section.

2. Template compression with a decreasing schedule

Templates are long, mostly static across requests, and expensive. PromptIntern applies a template compressor C (e.g. LLMLingua‑2, summarization model, or hand-crafted pruning) parameterized by a compression rate τ_tmp:

\tilde{x_tmp} = C(x_tmp, τ_tmp)

Then you define a schedule S_tmp(t) over training iterations/epochs t = 0..T:

  1. At t = 0: τ_tmp = 1.0 (full template, no compression).
  2. Linearly decrease τ_tmp across stages: e.g. 1.0 → 0.6 → 0.3 → 0.0.
  3. At final stage: τ_tmp = 0 → no template at all in the input.

The EMNLP 2024 paper compared linear, exponential, and inverse-exponential schedules and found linear decay gave the most stable accuracy across benchmarks.

3. Example absorption with retrieval and removal

Few-shot examples are powerful but extremely token-heavy. PromptIntern introduces:

  • Example retrieval: For each training instance, retrieve the top‑k most relevant examples from your dataset using a similarity function (e.g., BLEU on outputs, embedding similarity, or domain-specific metrics).
  • Example removal: Apply a schedule S_egs(t) that decreases k over training: e.g. 10 → 5 → 2 → 0.

Formally, for instance i with label y_i:

x_egs^i(t) = TopK_examples(D_train \ {i}, k(t), s(y_i, y_j))

Early on, the model sees many high-quality demonstrations. Later, as it internalizes those patterns, you remove them. At the end, k = 0, so only the query is left.

4. Progressive fine-tuning loop

Putting it together, PromptIntern’s training loop (simplified) is:

for t in training_stages:
    τ_tmp = S_tmp(t)
    k     = S_egs(t)

    for (x_tmp, x_egs, x_que, y) in D_train:
        # Preprocess prompt for this stage
        x_tmp_t = C(x_tmp, τ_tmp)
        x_egs_t = retrieve_examples(x_egs, y, k)

        prompt_t = concat(x_tmp_t, x_egs_t, x_que)
        loss = L(f_θ(prompt_t), y)
        θ = θ - η * ∇_θ loss

Key implementation details from the paper that you should mirror:

  • Use PEFT, typically LoRA, so you only train low-rank adapters instead of all weights (LoRA remains well-supported in 2025 across Llama 3.x, Mistral, etc.).
  • Keep the number of epochs similar to your baseline direct fine-tuning. You are changing inputs, not overtraining.
  • Always keep the query format stable (same markers, slots, etc.), because that’s what you’ll use in production.

5. Query-only inference

After the final stage, the fine-tuned model f_{θ_T} has internalized both the template and the examples. At inference time you call:

ŷ = f_{θ_T}(x_que)

No template. No docs. No examples. Just the current user query, ideally plus minimal routing metadata. In the paper, this compressed prompt regime achieved:

  • 5.3×–9.3× average compression across NL2Code tasks.
  • Strong accuracy versus “template with 5–10 shots” baselines while cutting tokens by 9.8×–27.4× in some settings.

Implementing PromptIntern for your stack (step-by-step)

This section gives a concrete implementation path you can adapt whether you are using OpenAI, Anthropic via Bedrock, or open-source models on your own GPUs.

Step 1: Choose your base model and deployment mode

  • Cloud APIs (minimal infra)
    • OpenAI: gpt-4.1-mini-2025-04-14 or gpt-4o-mini for supervised fine-tuning via the OpenAI fine-tuning API (see OpenAI’s 2025 fine-tuning docs).
    • Anthropic: Claude Sonnet 4.5 fine-tuning is supported via Amazon Bedrock and partner platforms.
  • Self-hosted open-source
    • Meta Llama 3.3 70B Instruct or Llama 3.1 8B/70B for high quality at lower cost (as of late 2025, 3.3‑70B matches or beats many proprietary mid‑tier models in benchmarks while being far cheaper per token).
    • Other PEFT-friendly models like Mistral, Qwen, or Mixtral.

For large-scale cost savings, a common pattern is:

  • Use Llama 3.3 70B (or similar) with PromptIntern as your task-specialist agent for high-volume workloads.
  • Reserve frontier proprietary models (Claude Sonnet 4.5, GPT‑4.1) for low-volume, highest-criticality flows.

Step 2: Build a prompt schema and dataset

Formalize your production prompts into a schema that explicitly tags template, examples, and query. For each training sample:

  1. Extract x_tmp from your current system prompt / instructions / docs.
  2. Collect a pool of x_egs few-shot examples (you’ll use retrieval to pick subsets per instance).
  3. Use real historical x_que from logs wherever possible.
  4. Store the ground truth output y (human-validated answers, gold labels, etc.).

Format as JSONL suitable for your fine-tuning provider (e.g., OpenAI’s chat fine-tuning format or a simple input → output pair for open-source training).

Step 3: Implement template compression

You have three pragmatic options:

  • Rule-based pruning: Strip redundant explanations, keep only constraints and key fields; effective and easy to control.
  • LLM-based summarization: Run templates through an in-house model (e.g., Llama 3.1 8B) that compresses docs while preserving API semantics.
  • Existing compressors: Integrate LLMLingua‑2 (used in PromptIntern’s baselines) to automatically select important tokens.

For each training stage t, apply the compressor with the appropriate τ_tmp from your linear schedule and cache the results so you don’t re-run compression on every epoch.

Step 4: Implement example retrieval and scheduling

Use an embedding-based index (e.g., FAISS, Elasticsearch with dense vectors, or a vector DB) over your labeled examples. For each training instance, retrieve top‑k(t) examples using cosine similarity on either:

  • The natural language questions.
  • The labeled outputs y (PromptIntern used BLEU on code outputs).

Then define a simple linear schedule for k, such as:

  • Stage 0–N1: k = 10 or 5.
  • Stage N1–N2: k = 5 or 2.
  • Stage N2–final: k = 0.

Experiments in the paper show that using a larger retrieval bank (up to 100% of the training set) improves performance, so avoid over-pruning your candidate pool.

Diagram showing two linear schedules over training stages: template compression ratio decreasing from 1.0 to 0.0 and number of examples decreasing from 10 to 0, with corresponding prompt structures at each stage from full template plus many examples to query-only
Progressive schedules: template gets compressed then removed; examples are reduced to zero.

Step 5: Fine-tune with PEFT / LoRA

For open-source models, use a framework like Hugging Face Transformers with PEFT/LoRA:

# Pseudocode: Llama 3.3 70B + LoRA + PromptIntern

for stage in stages:
    τ_tmp = template_schedule[stage]
    k     = example_schedule[stage]

    for batch in dataloader:
        x_tmp, x_egs, x_que, y = batch

        x_tmp_t = compress_template(x_tmp, τ_tmp)
        x_egs_t = retrieve_examples(x_egs, y, k)

        prompts = build_prompts(x_tmp_t, x_egs_t, x_que)
        loss = model(prompts, labels=y).loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

For OpenAI’s SFT on gpt-4o-mini or gpt-4.1-mini, preprocess each stage into a separate training file (JSONL) that already contains the compressed template and scheduled examples, and start a fine-tuning job via the fine-tuning API shown in OpenAI’s 2025 docs.

Step 6: Switch your agents to query-only prompts

Once validation shows the final stage (no template, no examples) matches your accuracy targets:

  1. Deploy the fine-tuned adapter/model into your serving stack (e.g., v2 of your “code-agent”).
  2. Update your routing/orchestration so that for that agent you now send only x_que (the user query plus minimal task metadata).
  3. Keep your old full-prompt baseline live behind a feature flag for A/B safety fallbacks.

Measure API token usage before/after; if your agent previously sent 3–5k tokens per call, expect input token counts to fall into the low hundreds, with corresponding cost and latency drops.

How PromptIntern compares to other cost-reduction methods

PromptIntern should sit alongside, not replace, other LLM cost strategies. Here’s how it compares.

TechniqueWhat it optimizesTypical savingsProsCons
PromptIntern (Progressive Fine-tuning)Input tokens via internalizing prompts>90% input reduction, ~4.2× speedup, ~88% cost (Zou et al., 2024)Stable accuracy, no prompts at inference, works with PEFTRequires fine-tuning pipeline, task-specific
Prompt compression (LLMLingua‑2, etc.)Prompt length2–4× token cutEasy to drop-in, no fine-tuning neededStill pay per request; can hurt accuracy
Prompt caching (OpenAI, Claude, Gemini)Repeated prefixesUp to 50–90% on cached segmentsGreat for static system prompts and RAG contextsHelps only when reuse is high; doesn’t solve long queries
Model routing / smaller modelsPer-token price30–70% vs frontier modelsSimple to implement, big wins for “easy” tasksDoesn’t solve long prompts; quality may drop
RAG instead of fine-tuningKnowledge freshness / storageVaries; can be cheaper than full FTNo retrain for knowledge updatesStill sends long retrieved docs; latency and tokens grow

The unique advantage of PromptIntern is that it targets the most stubborn cost component: long, repetitive templates and examples that appear on every call. Once internalized, they cost you nothing per request.

Best practices, pitfalls, and when to use PromptIntern

When PromptIntern is a good fit

  • You have stable tasks and prompts: e.g., NL→Code, NL→SQL, spreadsheet/BI agents, support macros, report generation.
  • Your prompts exceed ~1–2k tokens due to instructions, docs, or many examples.
  • You’re already paying for fine-tuning or have GPU capacity for PEFT.
  • Your workload is high-volume enough that per-request token savings dwarf one-time fine-tuning costs.

Common pitfalls

  • Skipping the progressive schedule: Training with full prompts and then suddenly switching to query-only inference performs poorly; the paper’s ablations show large drops vs PromptIntern’s staged approach.
  • Over-aggressive compression: Setting τ_tmp too low too early or too small a retrieval bank can hurt performance. Start with moderate compression and test.
  • Forgetting evaluation: Always keep a held-out test set and compare against your original prompt baseline so you don’t trade away accuracy for savings unknowingly.

Combining PromptIntern with other optimizations

  • Use PromptIntern to strip templates/examples, then still apply prompt caching for user-level or RAG-level context that remains.
  • Route simple queries to small, cheap models (Llama 3.2 3B, GPT‑4.1‑nano) and reserve the PromptIntern‑tuned specialist for complex tasks.
  • Pair with RAG for knowledge that truly changes often, while internalizing stable decision rules and formats.

Conclusion: Turning your prompts into weights for 90%+ savings

LLM cost reduction in 2025 is no longer just about cheaper models and caching. If your agents still drag around kilobytes of static instructions and examples on every call, you’re paying the “prompt tax.” PromptIntern’s Progressive Fine-tuning offers a practical, research-backed way to eliminate that tax by internalizing your prompts into the model’s weights. Experiments show you can cut input tokens by more than 90%, reduce latency by around 4×, and trim billable inference costs by nearly 88%, while matching or beating baseline accuracy.

To apply PromptIntern in your own stack, start by decomposing prompts into template, examples, and query; design linear schedules for template compression and example removal; and fine-tune with PEFT on a modern model like Llama 3.3 or GPT‑4o‑mini. Once validated, flip your production agents to query-only prompts and measure the impact. Combined with existing strategies like prompt caching, model routing, and RAG, PromptIntern can be the lever that finally brings your LLM costs and latency under control without sacrificing capability.

Written by promasoud