The Technical Root Cause: A Missing SQL Filter

Tip

Part 1 of 3: This is part of a series analyzing Cloudflare’s November 18, 2025 outage. ← Back to Overview | Part 2: Cascading Failures →

How Cloudflare Processes Requests

Every request to Cloudflare follows this path:

graph TD
    R[User Request] --> TLS[HTTP/TLS Termination]
    TLS --> CP[Core Proxy FL/FL2]
    CP --> BM[Bot Management Module]
    BM --> WAF[Other Modules: WAF, DDoS, etc.]
    WAF --> P[Pingora: Cache/Origin Fetch]
    P --> O[Origin Server]

    style BM fill:#f87171,color:#fff,stroke:#dc2626
    style CP fill:#fbbf24,color:#000,stroke:#d97706

The core proxy runs various security and performance modules for each request. One of these modules, Bot Management, was the source of the outage.

Bot Management’s Machine Learning Model

Cloudflare’s Bot Management uses an ML model to generate “bot scores” for every request traversing their network:

Bot Score: 1-99 (1 = likely bot, 99 = likely human)
Usage: Customers use scores to block/challenge/allow bots
Input: A “feature” configuration file

The Feature Configuration File

A “feature” is an individual trait used for prediction:

Request header patterns
Timing characteristics
Behavioral signals
IP reputation metrics

The feature file contains ~60 features normally. This file is:

Generated every 5 minutes from a ClickHouse query
Rapidly deployed globally to all edge servers
Critical for bot defense (attackers evolve quickly)

Why this matters: Frequent, global deployment of machine-generated configs became the attack vector.

ClickHouse Architecture: The Foundation

Understanding ClickHouse’s distributed query system is critical to understanding the bug.

Distributed Tables vs. Underlying Tables

graph TD
    Q[SQL Query] --> DT["Distributed Tables (default DB)"]
    DT -->|Fan out to shards| UT1["Shard 1 (r0 DB)"]
    DT -->|Fan out to shards| UT2["Shard 2 (r0 DB)"]
    DT -->|Fan out to shards| UT3["Shard 3 (r0 DB)"]

default database: Contains distributed tables (powered by Distributed table engine)
r0 database: Contains the actual shard tables where data lives
Queries to distributed tables automatically fan out to all shards in r0

The Permission Change at 11:05

Before:
Users querying metadata tables (like system.columns) only saw tables in the default database, even though they implicitly had access to r0 tables.

After the change:
Made r0 table access explicit - users could now see metadata for both default AND r0 tables.

Goal: Improve security by enabling fine-grained access grants and preventing one bad subquery from affecting others.

The SQL Query Bug

The Bot Management feature generation code used this query to discover available features:

1
-- Problematic query
2
SELECT name, type
3
FROM system.columns
4
WHERE table = 'http_requests_features'
5
ORDER BY name;

What Changed in Query Results

Before the permission change (returns ~60 rows):

1
| name                    | type   | database |
2
|-------------------------|--------|----------|
3
| request_header_ua       | String | default  |
4
| request_method          | String | default  |
5
| request_path            | String | default  |
6
| client_ip               | String | default  |
7
... (~60 features total)

After the permission change (returns ~120 rows):

1
| name                    | type   | database |
2
|-------------------------|--------|----------|
3
| request_header_ua       | String | default  |
4
| request_header_ua       | String | r0       |  ← Duplicate!
5
| request_method          | String | default  |
6
| request_method          | String | r0       |  ← Duplicate!
7
| request_path            | String | default  |
8
| request_path            | String | r0       |  ← Duplicate!
9
... (~120+ rows - doubled!)

The query didn’t filter by database name, so it returned features from both default and r0 schemas.

The Correct Query

1
SELECT name, type
2
FROM system.columns
3
WHERE database = 'default'  -- ← Missing filter!
4
  AND table = 'http_requests_features'
5
ORDER BY name;

One line of SQL would have prevented the entire outage.

The 200-Feature Hard Limit

The Bot Management module has a hard limit: maximum 200 features.

Why the limit exists:
Performance optimization. The module preallocates memory for features at startup:

1
# Conceptual implementation
2
class BotManagementModule:
3
    MAX_FEATURES = 200  # Hard limit
4

5
    def __init__(self):
6
        # Preallocate memory for performance
7
        self.feature_storage = [None] * self.MAX_FEATURES
8
        self.feature_metadata = [None] * self.MAX_FEATURES
9

10
    def load_features(self, feature_file):
11
        features = parse_feature_file(feature_file)
12

13
        if len(features) > self.MAX_FEATURES:
14
            # What happens here determines the failure mode
15
            raise TooManyFeaturesError(
16
                f"Feature count {len(features)} exceeds limit {self.MAX_FEATURES}"
17
            )

Normal operation: ~60 features, leaving 140-feature buffer (233% headroom)

After bad query: 120+ features, only 80-feature buffer (67% headroom)

Eventually, as more features were added or query CONTINUED returning duplicates, the limit was breached.

The Rust Panic

Here’s the actual FL2 code that failed:

1
// Simplified FL2 proxy code
2
fn process_request(request: HttpRequest) -> Result<HttpResponse, Error> {
3
    // Load bot management configuration
4
    let bot_config = load_bot_config()?;
5

6
    // Load features from the configuration file
7
    let features = load_bot_features(&bot_config)?
8
        .unwrap();  // ← Called .unwrap() on Err!
9

10
    // Calculate bot score using features
11
    let bot_score = calculate_bot_score(&request, &features);
12

13
    // Continue processing...
14
    Ok(response)
15
}
16

17
fn load_bot_features(config: &BotConfig) -> Result<Option<Features>, Error> {
18
    let feature_file = read_feature_file(config.feature_file_path())?;
19
    let features = parse_features(&feature_file)?;
20

21
    // Validate feature count
22
    if features.len() > MAX_FEATURES {
23
        return Err(Error::TooManyFeatures {
24
            count: features.len(),
25
            max: MAX_FEATURES,
26
        });
27
    }
28

29
    Ok(Some(features))
30
}

The Panic:

1
thread fl2_worker_thread panicked: called Result::unwrap() on an Err value:
2
Error::TooManyFeatures { count: 215, max: 200 }

Why .unwrap() is Dangerous

In Rust, .unwrap() essentially says: “This should never fail, and if it does, crash immediately.”

Better error handling:

1
let features = match load_bot_features(&bot_config)? {
2
    Some(f) => f,
3
    None => {
4
        // Graceful degradation
5
        log_error!("Failed to load features, using cached version");
6
        return use_cached_features_or_disable_bot_scoring();
7
    }
8
};

A single module failure shouldn’t bring down the entire proxy. The Bot Management module should:

Log the error
Use cached features
Or disable bot scoring and allow traffic through

The Intermittent Failure Pattern

The outage didn’t fail cleanly - it alternated between working and broken every 5 minutes.

Why the Fluctuation?

Feature file regenerated every 5 minutes from the ClickHouse query
Gradual ClickHouse cluster rollout: Only some nodes had the new permissions
Random query routing: Each query hit either an updated node (bad) or non-updated node (good)

Timeline:

1
11\:20 ─[Good]─ Query hits non-updated node → 60 features → Network OK
2
11\:25 ─[Bad]── Query hits updated node → 120+ features → Panic!
3
11\:30 ─[Good]─ Query hits non-updated node → 60 features → Network OK
4
11\:35 ─[Bad]── Query hits updated node → 120+ features → Panic!
5
...
6
12\:30 ─[Bad]── All nodes updated → Always bad → Stable failure

This made diagnosis extremely difficult - the team initially thought it was an external attack because:

Attacks often cause intermittent patterns
Internal config issues usually fail consistently
The 5-minute cycle suggested external timing

Different Failure Modes: FL vs FL2

Cloudflare was migrating traffic from their old proxy (FL) to a new version (FL2). Both were affected, but differently:

FL2 (new proxy):

Rust panic on feature count exceeded
Returned HTTP 5xx errors
Complete traffic failure for affected customers

FL (old proxy):

Different error handling
Set all bot scores to zero
Customers with “block bots” rules saw false positives (blocked legitimate traffic)
But no 5xx errors

This difference created confusing symptoms - some customers saw 5xx errors, others saw bot-blocking issues.

Why Machine-Generated ≠ Trusted

A critical assumption was violated: “Our code generates it, so it must be valid.”

The assumption:

We control the query
We control the database
Therefore, the output will always be correct

The reality:

The environment changed (new permissions)
The query had an implicit assumption (only default database visible)
The assumption became invalid

The lesson: Treat machine-generated config files with the same validation as user input:

1
def validate_feature_file(file_contents: bytes) -> Features:
2
    """Validate feature file before deployment."""
3
    features = parse_features(file_contents)
4

5
    # Validate count
6
    if len(features) > MAX_FEATURES:
7
        raise ValidationError(
8
            f"Too many features: {len(features)} > {MAX_FEATURES}"
9
        )
10

11
    # Check for duplicates
12
    feature_names = [f.name for f in features]
13
    if len(feature_names) != len(set(feature_names)):
14
        duplicates = find_duplicates(feature_names)
15
        raise ValidationError(f"Duplicate features: {duplicates}")
16

17
    # Validate file size (prevent memory issues)
18
    if len(file_contents) > MAX_FILE_SIZE:
19
        raise ValidationError(f"File too large: {len(file_contents)} bytes")
20

21
    # Validate feature schema
22
    for feature in features:
23
        if not feature.is_valid_schema():
24
            raise ValidationError(f"Invalid feature schema: {feature.name}")
25

26
    return features

This validation would have caught the issue immediately.

Conclusion

The technical root cause was deceptively simple:

ClickHouse permission change made r0 tables visible
SQL query lacked WHERE database = 'default' filter
Query returned duplicate rows, exceeding 200-feature limit
Rust code called .unwrap() on error, causing panic
HTTP 5xx errors for all traffic using Bot Management

The fix: one line of SQL.

The real lesson: assumptions must be explicit, validation must be comprehensive, and graceful degradation should be the default.

Tip

Continue Reading: Part 2: The Cascading Failures →
Learn how this technical failure cascaded through dependent services and why diagnosis took 3 hours.