Logo
Anatomy of a Failure: Cloudflare's November 18, 2025 Outage

Anatomy of a Failure: Cloudflare's November 18, 2025 Outage

Nov 20, 2025
6 min read (27 min read total)
3 subposts

The Day the Internet Stumbled

On November 18, 2025, at 11:28 UTC, Cloudflare - the company that powers roughly 20% of the Internet - began serving HTTP 5xx errors instead of websites. For nearly six hours, millions of websites, APIs, and applications became inaccessible to users worldwide.

This wasn’t a DDoS attack. It wasn’t a router failure. It wasn’t even a human clicking the wrong button.

It was a missing database filter in a SQL query combined with a hard-coded memory limit and an unhandled Rust panic. Three small mistakes that cascaded into Cloudflare’s worst outage since 2019.

The Incident at a Glance

Duration: 11:28 UTC - 17:06 UTC (5 hours 38 minutes)
Main Impact Resolved: 14:30 UTC (3 hours 2 minutes)
Scope: Global CDN failure affecting core HTTP traffic

Affected Services:

  • Core CDN: HTTP 5xx errors for most traffic
  • Workers KV: Elevated 5xx error rate
  • Cloudflare Access: Widespread authentication failures
  • Dashboard: Login unavailable
  • Turnstile: Failed to load globally
  • Email Security: Temporary loss of IP reputation source
Warning

Impact Scale: This was Cloudflare’s worst outage in over 6 years, affecting the majority of core traffic flowing through their network - a network that handles trillions of requests per day.

The Root Cause: A Perfect Storm

Four independent factors combined to create the outage:

1. The ClickHouse Permission Change (11:05 UTC)

Cloudflare deployed a security improvement to their ClickHouse database cluster. The change made underlying database tables (r0 schema) visible to users, who previously could only see distributed tables (default schema).

Intention: Improve fine-grained access control and prevent one bad query from affecting others.

2. The Missing SQL Filter

Bot Management’s feature file generation used this query:

-- Problematic query (missing database filter)
SELECT name, type
FROM system.columns
WHERE table = 'http_requests_features'
ORDER BY name;

The bug: No WHERE database = 'default' clause.

Result: After the permission change, the query returned duplicate rows - one for the default database, one for r0. Feature count ballooned from ~60 to 120+.

3. The Hard-Coded 200-Feature Limit

The Bot Management module had a hard limit of 200 features for performance (memory preallocation). With ~60 features normally, there was plenty of headroom.

When the bad query returned 120+ features, this limit was breached.

4. The Unhandled Rust Panic

The FL2 proxy code called .unwrap() on the error:

let features = load_bot_features()?.unwrap(); // Crash!

Result: Thread panic → HTTP 5xx errors for all traffic using Bot Management.

Important

One Line of SQL: Adding WHERE database = 'default' would have prevented the entire outage.

The Timeline

Time (UTC)EventTechnical Details
11:05Permission change deployedClickHouse r0 tables made visible to users
11:28First errors observedBad feature file reaches production
11:35Incident call createdTeam mobilizes for major incident
11:32-13:05Initial investigationFocused on Workers KV symptoms (red herring)
13:05Workers KV bypassImpact reduced significantly
13:37Focus shiftBot Management config file identified
14:24Bad file generation stoppedRoot cause partially mitigated
14:30Main impact resolvedCorrect configuration deployed globally
14:40-15:30Secondary dashboard impactLogin backlog overwhelmed control plane
17:06All services recoveredFull operations restored

Time to fix: 3 hours 2 minutes
Total duration: 5 hours 38 minutes

Why Diagnosis Was So Hard

The outage exhibited strange, intermittent behavior that confused the response team:

The 5-Minute Fluctuation: The feature file was regenerated every 5 minutes. Because the ClickHouse cluster was being gradually updated, sometimes the query ran on an updated node (bad file) and sometimes on a non-updated node (good file).

11\:20 ─[Good]─ Network operational
11\:25 ─[Bad]── HTTP 5xx errors surge
11\:30 ─[Good]─ Network recovers
11\:35 ─[Bad]── HTTP 5xx errors surge

This made it look like an external attack rather than a configuration issue.

Red Herrings:

  • Cloudflare’s status page went down at the same time (pure coincidence, unrelated issue)
  • Recent history of massive DDoS attacks (7.3 Tbps in September) made the team suspect another attack
  • Workers KV showed symptoms first, pointing investigation in the wrong direction

The Cascading Effect

The Bot Management failure didn’t stay contained:

  1. Core Proxy → HTTP 5xx errors for all traffic using Bot Management
  2. Workers KV → Depends on core proxy, elevated errors
  3. Access → Authentication failures (uses core proxy)
  4. Dashboard → Login unavailable (uses Turnstile + Workers KV)
  5. Turnstile → Failed globally (uses core proxy)

The Workers KV bypass at 13:05 was critical - even though it didn’t fix the root cause, it reduced impact while diagnosis continued.

What This Teaches Us

This outage is a masterclass in how small changes cascade in complex distributed systems:

Defense Layers That Failed:

  • ❌ No validation on machine-generated configuration files
  • ❌ No graceful degradation when limits exceeded (panic instead)
  • ❌ No global kill switch for Bot Management module
  • ❌ Observability systems consumed too much CPU during mass errors
  • ❌ Query assumptions not tested against new permission model

One Line Fix, Six Hours of Downtime: The technical fix was trivial - add one SQL filter. But the diagnosis took 3 hours because:

  • Intermittent failures looked like attacks
  • Multiple red herrings
  • Cascading symptoms across services
  • Complex distributed system interactions
Tip

The Real Lesson: In distributed systems, it’s not enough for each component to work correctly in isolation. You must test how they interact, how they fail, and how failures propagate.

Deep Dive Series

Want to understand the technical details, diagnosis challenges, and lessons learned? This analysis continues in three focused posts:

Part 1: The Technical Root Cause →
Deep dive into ClickHouse architecture, the SQL query bug, and why the Rust panic happened.

Part 2: The Cascading Failures →
How diagnosis became a detective story, complete with red herrings and breakthrough moments.

Part 3: Lessons Learned →
Actionable defense-in-depth principles with code examples you can apply to your own systems.

Conclusion

Cloudflare’s 6-hour outage was a painful reminder that in distributed systems, small changes can have catastrophic consequences. A well-intentioned security improvement exposed a hidden assumption in a SQL query, which triggered a chain reaction that brought down one of the Internet’s critical infrastructure providers.

The technical fix was one line of SQL. The real work is in building systems where one missing line doesn’t cascade into a global outage.

Cloudflare deserves credit for their transparent post-mortem and specific commitments to prevent recurrence. These kinds of failures happen to the best engineering teams. What separates great teams from good ones is how they learn from failure.


Source: Cloudflare Official Postmortem