GPT-5.1-Codex-Max Compaction: Guide to Large Refactors

As of November 2025, GPT-5.1-Codex-Max is OpenAI’s new frontier coding model, built specifically for long-running, project-scale work in Codex. Its standout capability is a new compaction feature that lets the model operate coherently over millions of tokens in a single task, unlocking multi-hour (even 24+ hour) refactors and deep debugging sessions without “context amnesia.” This guide is an evergreen how-to for developers who want to use GPT-5.1-Codex-Max’s compaction effectively for large-scale refactoring inside real codebases.

What GPT-5.1-Codex-Max and compaction actually do

GPT-5.1-Codex-Max (released November 19, 2025) is a variant of GPT-5.1 optimized for “agentic” coding tasks inside OpenAI’s Codex environment (CLI, IDE extensions, cloud workspaces, code review). Compared to GPT-5.1-Codex, it:

Handles long-horizon coding with compaction so it can “coherently work over millions of tokens in a single task.”
Uses ~30% fewer thinking tokens for similar or better performance on coding tasks (per OpenAI’s SWE-bench Verified numbers).
Supports multi-hour, even 24+ hour continuous loops that persistently iterate, run tests, and refine implementations.

Compaction is the core innovation for large refactors. When a Codex session approaches its context window limit, GPT-5.1-Codex-Max automatically:

Summarizes and prunes older parts of the conversation and tool outputs.
Retains key design decisions, goals, and unresolved issues.
Re-initializes a fresh context window seeded with that compacted summary plus recent steps.

This loop repeats until the task completes, giving you a single continuous agent run that spans many internal context windows without the usual “start a new chat, lose half the story” problem.

Diagram showing GPT-5.1-Codex-Max coding session: messages and tool calls accumulate to the context limit, then are summarized into a compact state while key decisions and TODOs are preserved, and the model continues in a fresh context window. This repeats over a multi-hour refactor. — How compaction lets GPT-5.1-Codex-Max roll over its context window while preserving project-level state.

When to use GPT-5.1-Codex-Max for large-scale refactors

OpenAI recommends GPT-5.1-Codex-Max specifically for agentic coding in Codex or Codex-like harnesses, not as a general-purpose chat model. It’s the right choice when:

You’re doing a project-scale refactor: e.g., migrating a framework, changing a core abstraction, or applying a consistent pattern across hundreds of files.
You need multi-hour continuity: long debugging sessions, deep test-fix cycles, or large feature implementations.
You want the agent to work semi-autonomously: running shell commands, editing files, running tests, and iterating in a loop.

For quick edits or one-off functions, GPT-5.1 or GPT-5.1-Codex (non-Max) with lower reasoning effort may be cheaper and more responsive. Max shines when context management is the bottleneck rather than raw latency.

Model	Best for	Key strengths
GPT-5.1	General coding and agentic tasks	Fast, adaptive reasoning, wide tooling support
GPT-5.1-Codex	Standard agentic coding in Codex	Great coding benchmarks, good for medium tasks
GPT-5.1-Codex-Max	Large-scale refactors, multi-hour sessions	Compaction, multi-window context, 24+ hour loops

How compaction works in practice inside Codex

You don’t call “compaction” directly in most workflows; Codex orchestrates it for you. The key is to work in a way that compaction can summarize accurately.

1. Use Codex surfaces that support long-running agents

Codex CLI: Run tasks against a local or remote repo; GPT-5.1-Codex-Max can edit files, run tests, and iterate.
IDE extension (VS Code, Cursor, Windsurf, etc.): Invoke larger refactors, but be aware that some IDEs still favor shorter loops; for truly massive work, the CLI or Codex Cloud tends to behave more predictably.
Codex Cloud workspaces: For repository-scale automation where the agent can run many tool calls in sequence.

In all these, the Codex runtime tracks session length and triggers compaction when the token budget nears its limit.

2. Structure your instructions for compaction-aware workflows

Compaction works by distilling what matters. Help it by making your intent and constraints easy to compress:

Start sessions with a clear mission brief: goals, non-goals, constraints (e.g., “no public API breakage”), and acceptance criteria.
Use stable, reusable markers in your messages: sections like PROJECT GOALS, INVARIANTS, OPEN ISSUES, DONE make it easier for the model to preserve them in summaries.
Periodically ask the agent to update a top-level TODO / plan so compaction has an explicit task spine to carry forward.

// Example: initial Codex CLI instruction for a large refactor

Refactor this repository to migrate from Redux Toolkit to Zustand.

PROJECT GOALS:
- Preserve all behavior and public APIs.
- Reduce boilerplate by leveraging Zustand's minimal store patterns.
- Keep test suite green.

INVARIANTS:
- Do not change any exported function signatures from /src/public-api.
- Maintain TypeScript strict mode with no new tsconfig relaxations.

CONSTRAINTS:
- Prefer incremental, feature-by-feature refactors.
- After each batch of changes, run the Jest test suite and fix regressions.

DELIVERABLES:
- Completed migration.
- A MIGRATION_NOTES.md summarizing key decisions and follow-up tasks.

These anchors give compaction something to intentionally preserve as history is compressed.

3. Let Codex compact automatically, but watch for drift

According to OpenAI’s product notes, GPT-5.1-Codex-Max will:

Detect when the session approaches the context limit.
Summarize conversation + tool calls, pruning low-signal content.
Continue in a fresh window with that compacted “memory” plus the most recent operations.

As the human in the loop, your job is to periodically re-ground the agent:

Every 30–60 minutes, ask: “Summarize what we’ve done, remaining risks, and next 5 concrete steps”.
Ensure critical constraints and goals still appear in its own summaries. If they don’t, restate them.
If you notice drift, explicitly correct: “We must not change the API of X. Confirm this is still enforced and adjust your plan.”

Workflow diagram for a long-running GPT-5.1-Codex-Max refactor: initialize with goals and invariants, iterative code edits and tests, periodic status summaries, automatic compaction when token limit nears, and continued work using compacted session state. — Recommended long-running loop with GPT-5.1-Codex-Max: explicit goals, iterative edits, periodic summaries, and automatic compaction.

Concrete workflow: project-scale refactor with compaction

Let’s walk through a realistic large-scale refactor using GPT-5.1-Codex-Max in Codex CLI.

Step 1: Prepare your repo for an AI-driven refactor

Clean your main branch: merge or rebase outstanding PRs; avoid large concurrent changes.
Ensure tests pass: the agent relies heavily on test feedback.
Create a working branch, e.g. feature/zustand-migration.
Add a PROJECT.md or MIGRATION_NOTES.md with:
- Architecture overview.
- Key invariants and contracts.
- High-level refactor goals.

Step 2: Kick off a focused Codex-Max session

# Pseudo-command; actual syntax depends on Codex CLI version
codex run \
  --model gpt-5.1-codex-max \
  --repo . \
  --task "Migrate from Redux Toolkit to Zustand as described in PROJECT.md. 
          Work incrementally, keep tests passing, and document decisions in MIGRATION_NOTES.md."

In the Codex interaction, reinforce compaction-friendly structure:

Before you begin, restate:
- Your understanding of the project structure
- The migration plan (phased steps)
- The key invariants you must not break

As you work, maintain:
- A running high-level changelog
- An updated TODO list
- Notes on any risky or partial changes

Step 3: Let the agent run, but steer it at compaction boundaries

Over time, you’ll see the agent:

Inspect files and directories.
Modify code via patch operations or file rewrites.
Run tests (e.g. Jest, pytest, Maven) and analyze failures.
Propose and execute follow-up fixes.

Whenever Codex logs or implies a compaction event (e.g., “Compacting session to free up space”), respond with prompts that strengthen its distilled memory:

We just compacted the session.

1. Summarize the overall migration progress in 10 bullet points.
2. List any areas you consider partially migrated or risky.
3. Restate the key invariants you're enforcing.
4. Propose the next 3 focused batches of work.

This ensures the new context window is seeded with a high-quality, human-validated summary, not just the model’s automatic compression.

Step 4: Enforce safety and review before merge

OpenAI’s own guidance emphasizes that Codex, even with GPT-5.1-Codex-Max, should be treated as an additional reviewer, not a replacement for code review. For large refactors:

Run your full test suite and static analysis tools (lint, type-checkers, SAST) yourself.
Use GPT-5.1-Codex-Max again for code review on the final diff:
- “Review this PR for regressions, API changes, and performance risks.”
Require human sign-off on high-impact packages or APIs.

Codex runs in a sandboxed environment by default (file writes limited to workspace, network disabled). Unless you have a very controlled need, leave external network access off to reduce prompt-injection risk.

Compaction-aware prompting patterns that work well

Because compaction is essentially high-fidelity summarization of your session, certain prompt patterns make it more reliable during long refactors.

Use stable “memory surfaces” inside the repo

Borrowing from Anthropic’s context engineering best practices, you can give GPT-5.1-Codex-Max explicit places to write memory that survive compaction, such as:

MIGRATION_NOTES.md – high-level decisions, tradeoffs, and open questions.
TASKS_TODO.md – remaining steps, prioritized.
Module-level README.md files – for new architecture or patterns.

As you work, keep MIGRATION_NOTES.md updated with:
- Architectural decisions
- Rationale for non-trivial changes
- Any known follow-up tasks

Treat this file as your long-term memory:
it must remain accurate even after many compaction cycles.

Because these files live in the repo, Codex can reload them via tools after compaction, even if earlier chat messages have been compressed away.

Prefer explicit, small batches over global rewrites

Compaction is easier and safer when the narrative is:

“We completed batch A, then B, then C, with tests between each,” rather than “We changed everything at once.”

So ask GPT-5.1-Codex-Max to:

Work feature by feature or module by module.
Describe each batch before it starts:
- “Next, I will migrate the auth module: files X, Y, Z.”
Run tests after each batch and record results in the notes file.

Regularly request compressed, structured status

Help the model practice good compaction on your behalf throughout the session with prompts like:

Every 30 minutes (or after a major batch of changes), do this:

1. Update MIGRATION_NOTES.md with:
   - Completed steps
   - Key decisions
   - Any known regressions

2. Reply here with:
   - A concise bullet summary of progress
   - Current risks
   - Next 3 steps

These micro-summaries become the skeleton that compaction uses to maintain coherence across context windows.

Integrating GPT-5.1-Codex-Max into custom tooling

API access for GPT-5.1-Codex-Max is “coming soon,” but you can design your orchestration in advance, especially if you already use GPT-5.1 or GPT-5.1-Codex via the Responses API.

Design your own compaction layer (optional)

Even with built-in compaction in Codex, many teams will add a secondary, explicit compaction layer in their orchestration service:

Track conversation + tool calls in your own store (e.g. database, vector store, or plain logs).
Before each API call, slice down to:
- Latest N steps.
- One or more explicit summaries of history.
- Key notes loaded from repo files.
Every M turns, call the model to refresh the summary that you send in future prompts.

// Pseudo-code sketch of an external compaction-aware loop

while (!task_done) {
  const summary = getOrRefreshSummary(history);   // compacted high-level state
  const recent   = getRecentMessages(history, 8); // last few high-signal steps

  const messages = [
    systemPrompt,
    summary,
    ...recent,
    currentUserInstruction
  ];

  const response = openai.responses.create({
    model: "gpt-5.1-codex-max",
    messages,
    tools: [apply_patch, shell, ...],
    reasoning_effort: "medium" // or "xhigh" for critical steps
  });

  // apply tool calls, tests, etc.
  history.push(response);
}

This pattern combines Codex’s internal compaction with your own guardrails, giving you more predictable behavior across very long tasks.

Combine with GPT-5.1 tools for precise diffs

In non-Codex environments, GPT-5.1 offers an apply_patch tool and a shell tool in the API. For large refactors, design your system so that:

GPT-5.1-Codex-Max plans and reasons about batch refactors and tests.
The apply_patch tool applies small, reviewable diffs.
The shell tool runs tests and diagnostics, feeding logs back to the model.

Your orchestration engine then decides when to checkpoint, summarize, or roll back based on CI results.

Architecture diagram: client or IDE talks to an orchestrator service; the orchestrator routes tasks to GPT-5.1-Codex-Max, which uses tools like apply_patch and shell; a compaction layer stores summaries and history; code changes flow to a Git repo and CI pipeline. — High-level architecture for using GPT-5.1-Codex-Max and compaction inside a custom refactoring pipeline.

Limitations, gotchas, and best practices

Even with compaction, GPT-5.1-Codex-Max isn’t magic. To get reliable large-scale refactors, keep these constraints in mind:

Context rot still exists: compaction reduces but doesn’t eliminate the risk that subtle early details get lost. Re-ground periodically and keep critical invariants in persistent files.
Heavily noisy logs (e.g., giant stack traces) consume attention without adding much value. Pre-filter or summarize logs before handing them to the model when possible.
Reasoning_effort vs. cost: “xhigh” unlocks best performance on tough steps but is slower and more expensive. Use it only for critical migrations (e.g., core infrastructure modules) and keep “medium” as your default.
Security: long-running agents with shell access can be a powerful attack surface. Keep Codex sandboxed, log tool calls, and monitor for suspicious behavior.

Most importantly, treat GPT-5.1-Codex-Max as a very capable collaborator, not an infallible automaton. Its compaction feature gives you continuity over huge tasks; your engineering practices give it direction, safety, and quality control.

Conclusion: making compaction your ally in multi-hour refactors

GPT-5.1-Codex-Max’s compaction feature solves one of the most painful limits of earlier AI coding models: losing context mid-project. By automatically summarizing and pruning history, it can operate coherently over millions of tokens and 24+ hour runs, making project-scale refactors, deep debugging, and complex feature work realistic for a single long-lived agent.

To get real value out of it, you should:

Choose GPT-5.1-Codex-Max for large, long-running coding tasks where context continuity matters.
Structure your instructions, notes, and repo files so they survive and guide compaction.
Run multi-hour sessions in Codex with periodic summaries and explicit plans that the model can carry forward.
Layer Codex’s built-in compaction with your own orchestration (summaries, checkpoints, CI) for maximum reliability.

If you adopt these practices, GPT-5.1-Codex-Max can evolve from “smart autocomplete” into a persistent engineering partner that stays on track from the first commit in your refactor branch to the final green build.