The Day the Internet Stumbled
On November 18, 2025, at 11:28 UTC, Cloudflare - the company that powers roughly 20% of the Internet - began serving HTTP 5xx errors instead of websites. For nearly six hours, millions of websites, APIs, and applications became inaccessible to users worldwide.
This wasn’t a DDoS attack. It wasn’t a router failure. It wasn’t even a human clicking the wrong button.
It was a missing database filter in a SQL query combined with a hard-coded memory limit and an unhandled Rust panic. Three small mistakes that cascaded into Cloudflare’s worst outage since 2019.
The Incident at a Glance
Duration: 11:28 UTC - 17:06 UTC (5 hours 38 minutes)
Main Impact Resolved: 14:30 UTC (3 hours 2 minutes)
Scope: Global CDN failure affecting core HTTP traffic
Affected Services:
- Core CDN: HTTP 5xx errors for most traffic
- Workers KV: Elevated 5xx error rate
- Cloudflare Access: Widespread authentication failures
- Dashboard: Login unavailable
- Turnstile: Failed to load globally
- Email Security: Temporary loss of IP reputation source
Warning
Impact Scale: This was Cloudflare’s worst outage in over 6 years, affecting the majority of core traffic flowing through their network - a network that handles trillions of requests per day.
The Root Cause: A Perfect Storm
Four independent factors combined to create the outage:
1. The ClickHouse Permission Change (11:05 UTC)
Cloudflare deployed a security improvement to their ClickHouse database cluster. The change made underlying database tables (r0 schema) visible to users, who previously could only see distributed tables (default schema).
Intention: Improve fine-grained access control and prevent one bad query from affecting others.
2. The Missing SQL Filter
Bot Management’s feature file generation used this query:
-- Problematic query (missing database filter)SELECT name, typeFROM system.columnsWHERE table = 'http_requests_features'ORDER BY name;The bug: No WHERE database = 'default' clause.
Result: After the permission change, the query returned duplicate rows - one for the default database, one for r0. Feature count ballooned from ~60 to 120+.
3. The Hard-Coded 200-Feature Limit
The Bot Management module had a hard limit of 200 features for performance (memory preallocation). With ~60 features normally, there was plenty of headroom.
When the bad query returned 120+ features, this limit was breached.
4. The Unhandled Rust Panic
The FL2 proxy code called .unwrap() on the error:
let features = load_bot_features()?.unwrap(); // Crash!Result: Thread panic → HTTP 5xx errors for all traffic using Bot Management.
Important
One Line of SQL: Adding WHERE database = 'default' would have prevented the entire outage.
The Timeline
| Time (UTC) | Event | Technical Details |
|---|---|---|
| 11:05 | Permission change deployed | ClickHouse r0 tables made visible to users |
| 11:28 | First errors observed | Bad feature file reaches production |
| 11:35 | Incident call created | Team mobilizes for major incident |
| 11:32-13:05 | Initial investigation | Focused on Workers KV symptoms (red herring) |
| 13:05 | Workers KV bypass | Impact reduced significantly |
| 13:37 | Focus shift | Bot Management config file identified |
| 14:24 | Bad file generation stopped | Root cause partially mitigated |
| 14:30 | Main impact resolved | Correct configuration deployed globally |
| 14:40-15:30 | Secondary dashboard impact | Login backlog overwhelmed control plane |
| 17:06 | All services recovered | Full operations restored |
Time to fix: 3 hours 2 minutes
Total duration: 5 hours 38 minutes
Why Diagnosis Was So Hard
The outage exhibited strange, intermittent behavior that confused the response team:
The 5-Minute Fluctuation: The feature file was regenerated every 5 minutes. Because the ClickHouse cluster was being gradually updated, sometimes the query ran on an updated node (bad file) and sometimes on a non-updated node (good file).
11\:20 ─[Good]─ Network operational11\:25 ─[Bad]── HTTP 5xx errors surge11\:30 ─[Good]─ Network recovers11\:35 ─[Bad]── HTTP 5xx errors surgeThis made it look like an external attack rather than a configuration issue.
Red Herrings:
- Cloudflare’s status page went down at the same time (pure coincidence, unrelated issue)
- Recent history of massive DDoS attacks (7.3 Tbps in September) made the team suspect another attack
- Workers KV showed symptoms first, pointing investigation in the wrong direction
The Cascading Effect
The Bot Management failure didn’t stay contained:
- Core Proxy → HTTP 5xx errors for all traffic using Bot Management
- Workers KV → Depends on core proxy, elevated errors
- Access → Authentication failures (uses core proxy)
- Dashboard → Login unavailable (uses Turnstile + Workers KV)
- Turnstile → Failed globally (uses core proxy)
The Workers KV bypass at 13:05 was critical - even though it didn’t fix the root cause, it reduced impact while diagnosis continued.
What This Teaches Us
This outage is a masterclass in how small changes cascade in complex distributed systems:
Defense Layers That Failed:
- ❌ No validation on machine-generated configuration files
- ❌ No graceful degradation when limits exceeded (panic instead)
- ❌ No global kill switch for Bot Management module
- ❌ Observability systems consumed too much CPU during mass errors
- ❌ Query assumptions not tested against new permission model
One Line Fix, Six Hours of Downtime: The technical fix was trivial - add one SQL filter. But the diagnosis took 3 hours because:
- Intermittent failures looked like attacks
- Multiple red herrings
- Cascading symptoms across services
- Complex distributed system interactions
Tip
The Real Lesson: In distributed systems, it’s not enough for each component to work correctly in isolation. You must test how they interact, how they fail, and how failures propagate.
Deep Dive Series
Want to understand the technical details, diagnosis challenges, and lessons learned? This analysis continues in three focused posts:
Part 1: The Technical Root Cause →
Deep dive into ClickHouse architecture, the SQL query bug, and why the Rust panic happened.
Part 2: The Cascading Failures →
How diagnosis became a detective story, complete with red herrings and breakthrough moments.
Part 3: Lessons Learned →
Actionable defense-in-depth principles with code examples you can apply to your own systems.
Conclusion
Cloudflare’s 6-hour outage was a painful reminder that in distributed systems, small changes can have catastrophic consequences. A well-intentioned security improvement exposed a hidden assumption in a SQL query, which triggered a chain reaction that brought down one of the Internet’s critical infrastructure providers.
The technical fix was one line of SQL. The real work is in building systems where one missing line doesn’t cascade into a global outage.
Cloudflare deserves credit for their transparent post-mortem and specific commitments to prevent recurrence. These kinds of failures happen to the best engineering teams. What separates great teams from good ones is how they learn from failure.
Source: Cloudflare Official Postmortem