Cloudflare’s Outage: How a Config File Caused a Cascade Failure

2025-11-19427-cloud-config-domino-outage

The November 18, 2025 Cloudflare outage is NEWS CONTENT, not evergreen. It was a large, recent incident: a bot feature configuration file, generated from a ClickHouse query, doubled in size and hit a memory/feature-limit bug in Cloudflare’s Rust-based core proxy (FL2), causing cascading 5xx errors across core CDN and security services. Below is a 300–500 word news-style post-mortem-style article, focused on the chain of events and why it matters for system design and resilience.

Cloudflare outage November 2025: how a config file crashed the core proxy

On November 18, 2025 at 11:20 UTC, Cloudflare began serving elevated HTTP 5xx errors across its global network, briefly disrupting access to services such as X, Spotify, OpenAI and many others. Cloudflare’s official post-mortem, published the same day, confirms the outage was not caused by an attack, but by a subtle database-driven configuration change in its Bot Management pipeline that cascaded into a failure of its new Rust-based proxy engine, FL2.
Source: Cloudflare outage on November 18, 2025 (Cloudflare Blog, 2025‑11‑18).

What happened: from ClickHouse query to global 5xx

Cloudflare’s Enterprise Bot Management relies on a “feature” configuration file that feeds its ML-based bot scoring engine. This file is regenerated every few minutes from a ClickHouse cluster and rapidly propagated to all edge proxies so the bot model can adapt to new traffic patterns.

At 11:05 UTC, Cloudflare rolled out a ClickHouse permissions change so users could see metadata not only in the default database, but also in underlying r0 tables. A query of system.columns that built the bot feature file assumed only default would be returned and did not filter on database. After the change, the query started returning duplicate column rows from both default and r0, more than doubling the number of “features” emitted into the JSON configuration file.

That oversized feature file was then pushed out via Cloudflare’s internal config distribution to every machine running the core proxy. Inside FL2’s Bot Management module, Rust code pre-allocates an array for features with a hard limit of 200. The bloated file exceeded that limit; the code performed a bounds check and then called unwrap() on an error, triggering a panic:

thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

Any request path that needed the bots module either failed (on FL2) with HTTP 5xx or got a bot score of zero (on legacy FL), breaking bot-based firewall rules and downstream services like Workers KV and Access.

Why it was a cascading failure

The failure pattern was particularly confusing because the ClickHouse cluster was partially migrated: every five minutes, depending on which shard executed the query, either a “good” or “bad” feature file was generated and pushed. The network oscillated between healthy and failing states until all shards were migrated and the system settled into a persistent failure.

Core 5xx rates remained elevated until 14:30 UTC, when engineers halted feature-file generation, injected a last-known-good file into the distribution queue, and forced FL/FL2 restarts. All downstream impact was cleared by 17:06 UTC.

Why it matters: configuration pipelines as critical code

This incident highlights a modern resilience risk: database-driven configuration treated as “data” but behaving like executable code.

  • Weak schema assumptions: a system.columns query without an explicit database filter silently changed cardinality when ClickHouse 25.x permissions were updated.
  • Brittle limits: a hard-coded feature cap and unwrap() in Rust turned a validation error into a full process panic.
  • Fast config fan-out: Cloudflare’s strength—global, rapid propagation of bot feature updates—also meant a bad file hit nearly every proxy before being stopped.

Cloudflare’s remediation and next steps

Cloudflare lists several corrective actions in its November 18 post-mortem:

  • Harden config ingestion for internally generated files to the same standard as user input: strict schema validation, size and cardinality checks, and safe failure modes.
  • Add global kill switches for features like Bot Management so modules can be disabled network-wide without restarting the proxy.
  • Review failure modes across all FL2 modules to eliminate panics on resource-limit violations, replacing unwrap() patterns with explicit error handling.
  • Limit observability overhead where crash reporting and debug hooks were consuming high CPU during the outage.

For engineers designing high-scale systems, the outage is a textbook example of how “minor” metadata or permissions changes in analytics databases like ClickHouse 25.10 can leak into configuration generation, bypass validation, and trigger latent bugs in critical request paths.

Implications for system designers

As of November 2025, Cloudflare has largely migrated to its Rust-based FL2 proxy (announced September 26, 2025), gaining performance and safety—but this incident shows that proxy robustness is only as strong as the validation and fail-safes around its configuration pipelines. For similar environments, key practices include:

  • Pinning schema queries (e.g., ClickHouse system.columns) to explicit databases and versioned contracts.
  • Introducing “canary” config rollouts, with automatic rollback on error rates or proxy rejections.
  • Treating config as code: type-check, fuzz, and load-test configuration in staging proxies before global distribution.

Cloudflare’s CEO and engineering leadership have publicly committed to preventing a repeat, but for the wider industry, the November 2025 outage will be studied as a cautionary tale in the dangers of loosely validated, database-driven configuration in critical paths.

Written by promasoud