Anatomy of a Failure: Cloudflare's November 18, 2025 Outage

The Day the Internet Stumbled

On November 18, 2025, at 11:28 UTC, Cloudflare - the company that powers roughly 20% of the Internet - began serving HTTP 5xx errors instead of websites. For nearly six hours, millions of websites, APIs, and applications became inaccessible to users worldwide.

This wasn’t a DDoS attack. It wasn’t a router failure. It wasn’t even a human clicking the wrong button.

It was a missing database filter in a SQL query combined with a hard-coded memory limit and an unhandled Rust panic. Three small mistakes that cascaded into Cloudflare’s worst outage since 2019.

The Incident at a Glance

Duration: 11:28 UTC - 17:06 UTC (5 hours 38 minutes)
Main Impact Resolved: 14:30 UTC (3 hours 2 minutes)
Scope: Global CDN failure affecting core HTTP traffic

Affected Services:

Core CDN: HTTP 5xx errors for most traffic
Workers KV: Elevated 5xx error rate
Cloudflare Access: Widespread authentication failures
Dashboard: Login unavailable
Turnstile: Failed to load globally
Email Security: Temporary loss of IP reputation source

Warning

Impact Scale: This was Cloudflare’s worst outage in over 6 years, affecting the majority of core traffic flowing through their network - a network that handles trillions of requests per day.

The Root Cause: A Perfect Storm

Four independent factors combined to create the outage:

1. The ClickHouse Permission Change (11:05 UTC)

Cloudflare deployed a security improvement to their ClickHouse database cluster. The change made underlying database tables (r0 schema) visible to users, who previously could only see distributed tables (default schema).

Intention: Improve fine-grained access control and prevent one bad query from affecting others.

2. The Missing SQL Filter

Bot Management’s feature file generation used this query:

1
-- Problematic query (missing database filter)
2
SELECT name, type
3
FROM system.columns
4
WHERE table = 'http_requests_features'
5
ORDER BY name;

The bug: No WHERE database = 'default' clause.

Result: After the permission change, the query returned duplicate rows - one for the default database, one for r0. Feature count ballooned from ~60 to 120+.

3. The Hard-Coded 200-Feature Limit

The Bot Management module had a hard limit of 200 features for performance (memory preallocation). With ~60 features normally, there was plenty of headroom.

When the bad query returned 120+ features, this limit was breached.

4. The Unhandled Rust Panic

The FL2 proxy code called .unwrap() on the error:

1
let features = load_bot_features()?.unwrap();  // Crash!

Result: Thread panic → HTTP 5xx errors for all traffic using Bot Management.

Important

One Line of SQL: Adding WHERE database = 'default' would have prevented the entire outage.

The Timeline

Time (UTC)	Event	Technical Details
11:05	Permission change deployed	ClickHouse `r0` tables made visible to users
11:28	First errors observed	Bad feature file reaches production
11:35	Incident call created	Team mobilizes for major incident
11:32-13:05	Initial investigation	Focused on Workers KV symptoms (red herring)
13:05	Workers KV bypass	Impact reduced significantly
13:37	Focus shift	Bot Management config file identified
14:24	Bad file generation stopped	Root cause partially mitigated
14:30	Main impact resolved	Correct configuration deployed globally
14:40-15:30	Secondary dashboard impact	Login backlog overwhelmed control plane
17:06	All services recovered	Full operations restored

Time to fix: 3 hours 2 minutes
Total duration: 5 hours 38 minutes

Why Diagnosis Was So Hard

The outage exhibited strange, intermittent behavior that confused the response team:

The 5-Minute Fluctuation: The feature file was regenerated every 5 minutes. Because the ClickHouse cluster was being gradually updated, sometimes the query ran on an updated node (bad file) and sometimes on a non-updated node (good file).

1
11\:20 ─[Good]─ Network operational
2
11\:25 ─[Bad]── HTTP 5xx errors surge
3
11\:30 ─[Good]─ Network recovers
4
11\:35 ─[Bad]── HTTP 5xx errors surge

This made it look like an external attack rather than a configuration issue.

Red Herrings:

Cloudflare’s status page went down at the same time (pure coincidence, unrelated issue)
Recent history of massive DDoS attacks (7.3 Tbps in September) made the team suspect another attack
Workers KV showed symptoms first, pointing investigation in the wrong direction

The Cascading Effect

The Bot Management failure didn’t stay contained:

Core Proxy → HTTP 5xx errors for all traffic using Bot Management
Workers KV → Depends on core proxy, elevated errors
Access → Authentication failures (uses core proxy)
Dashboard → Login unavailable (uses Turnstile + Workers KV)
Turnstile → Failed globally (uses core proxy)

The Workers KV bypass at 13:05 was critical - even though it didn’t fix the root cause, it reduced impact while diagnosis continued.

What This Teaches Us

This outage is a masterclass in how small changes cascade in complex distributed systems:

Defense Layers That Failed:

❌ No validation on machine-generated configuration files
❌ No graceful degradation when limits exceeded (panic instead)
❌ No global kill switch for Bot Management module
❌ Observability systems consumed too much CPU during mass errors
❌ Query assumptions not tested against new permission model

One Line Fix, Six Hours of Downtime: The technical fix was trivial - add one SQL filter. But the diagnosis took 3 hours because:

Intermittent failures looked like attacks
Multiple red herrings
Cascading symptoms across services
Complex distributed system interactions

Tip

The Real Lesson: In distributed systems, it’s not enough for each component to work correctly in isolation. You must test how they interact, how they fail, and how failures propagate.

Deep Dive Series

Want to understand the technical details, diagnosis challenges, and lessons learned? This analysis continues in three focused posts:

Part 1: The Technical Root Cause →
Deep dive into ClickHouse architecture, the SQL query bug, and why the Rust panic happened.

Part 2: The Cascading Failures →
How diagnosis became a detective story, complete with red herrings and breakthrough moments.

Part 3: Lessons Learned →
Actionable defense-in-depth principles with code examples you can apply to your own systems.

Conclusion

Cloudflare’s 6-hour outage was a painful reminder that in distributed systems, small changes can have catastrophic consequences. A well-intentioned security improvement exposed a hidden assumption in a SQL query, which triggered a chain reaction that brought down one of the Internet’s critical infrastructure providers.

The technical fix was one line of SQL. The real work is in building systems where one missing line doesn’t cascade into a global outage.

Cloudflare deserves credit for their transparent post-mortem and specific commitments to prevent recurrence. These kinds of failures happen to the best engineering teams. What separates great teams from good ones is how they learn from failure.

Source: Cloudflare Official Postmortem