Logo
Defense in Depth: Lessons from Cloudflare

Defense in Depth: Lessons from Cloudflare

Nov 20, 2025
8 min read
Tip

Part 3 of 3: This is part of a series analyzing Cloudflare’s November 18, 2025 outage. ← Part 2: Cascading Failures | Back to Overview

Defense in Depth Failed

The Cloudflare outage revealed multiple defense layers that should have caught the issue but didn’t:

  • ❌ No validation on machine-generated config files
  • ❌ No graceful degradation when limits exceeded
  • ❌ No global kill switch for Bot Management
  • ❌ Observability systems consumed too much CPU during failures
  • ❌ Query assumptions not tested against permission changes
  • ❌ No explicit documentation of query dependencies

One missing SQL filter shouldn’t cause a 6-hour global outage. Let’s examine what defense layers should have been in place.

Lesson 1: Validate Machine-Generated Configuration

The Problem: Bot Management assumed its own generated config files were always valid.

The Fix: Treat internally-generated configs with the same suspicion as user input.

Comprehensive Validation

class FeatureFileValidator:
"""Validate feature configuration files before deployment."""
MAX_FEATURES = 200
MAX_FILE_SIZE = 10 * 1024 * 1024 # 10MB
def validate(self, file_contents: bytes, file_path: str) -> ValidatedFeatures:
"""
Validate feature file with multiple checks.
Returns validated features or raises ValidationError.
"""
# Check 1: File size
if len(file_contents) > self.MAX_FILE_SIZE:
raise ValidationError(
f"File too large: {len(file_contents)} bytes > {self.MAX_FILE_SIZE}",
file=file_path
)
# Check 2: Parse the file
try:
features = parse_features(file_contents)
except ParseError as e:
raise ValidationError(f"Failed to parse file: {e}", file=file_path)
# Check 3: Feature count
if len(features) > self.MAX_FEATURES:
raise ValidationError(
f"Too many features: {len(features)} > {self.MAX_FEATURES}",
file=file_path,
feature_count=len(features)
)
if len(features) == 0:
raise ValidationError("No features found", file=file_path)
# Check 4: Duplicate detection
feature_names = [f.name for f in features]
duplicates = find_duplicates(feature_names)
if duplicates:
raise ValidationError(
f"Duplicate features detected: {duplicates}",
file=file_path
)
# Check 5: Schema validation
for feature in features:
if not self._validate_feature_schema(feature):
raise ValidationError(
f"Invalid feature schema: {feature.name}",
file=file_path
)
# Check 6: Sudden size changes (anomaly detection)
previous_count = self._get_previous_feature_count()
if previous_count and len(features) > previous_count * 1.5:
raise ValidationError(
f"Suspicious feature count increase: {previous_count}{len(features)}",
file=file_path,
alert_ops_team=True
)
return ValidatedFeatures(features, file_path)
def _validate_feature_schema(self, feature: Feature) -> bool:
"""Validate individual feature schema."""
required_fields = ['name', 'type', 'metadata']
return all(hasattr(feature, field) for field in required_fields)
# Usage in deployment pipeline
def deploy_feature_file(file_path: str):
"""Deploy feature file with validation."""
with open(file_path, 'rb') as f:
contents = f.read()
validator = FeatureFileValidator()
try:
validated = validator.validate(contents, file_path)
except ValidationError as e:
log_error(f"Feature file validation failed: {e}")
alert_ops_team(e)
# CRITICAL: Don't deploy invalid file!
return False
# Deploy only if validation passed
deploy_to_network(validated.features)
return True

Key principle: Catch errors before deployment, not after. Validation at generation time prevents bad configs from ever reaching production.

Lesson 2: Graceful Degradation Over Panics

The Problem: Rust code called .unwrap() on error, causing panic → HTTP 5xx.

The Fix: Every module should have a degraded mode.

Error Handling: Bad vs. Good

Bad Code (what Cloudflare had):

fn process_request(request: HttpRequest) -> Result<HttpResponse, Error> {
let features = load_bot_features()?.unwrap(); // Panic on error!
let bot_score = calculate_bot_score(&request, &features);
Ok(build_response(bot_score))
}

Good Code (with graceful degradation):

fn process_request(request: HttpRequest) -> Result<HttpResponse, Error> {
let features = match load_bot_features() {
Ok(Some(f)) => f,
Ok(None) | Err(_) => {
// Graceful degradation strategy
log_error!("Bot features unavailable, using fallback");
metrics::increment("bot_management.degraded_mode");
// Option 1: Use cached features
if let Some(cached) = get_cached_features() {
cached
} else {
// Option 2: Disable bot scoring, allow traffic
return Ok(build_response_without_bot_scoring());
}
}
};
let bot_score = calculate_bot_score(&request, &features);
Ok(build_response(bot_score))
}
fn get_cached_features() -> Option<Features> {
"""Return last known-good features from cache."""
FEATURE_CACHE.read()
.ok()
.and_then(|cache| cache.get_last_valid())
}

Principle: Fail Open or Cached, Never Fail Catastrophically

class BotManagementModule:
def __init__(self):
self.feature_cache = FeatureCache(ttl=3600) # 1 hour cache
self.degraded_mode = False
def process_request(self, request):
"""Process request with automatic fallback."""
try:
features = self.load_features()
except TooManyFeaturesError as e:
# Degraded mode: use cached features
log_error(f"Feature load failed: {e}, entering degraded mode")
self.degraded_mode = True
features = self.feature_cache.get_last_valid()
if features is None:
# Ultimate fallback: disable bot scoring
log_critical("No cached features, disabling bot module")
return self.process_without_bot_scoring(request)
bot_score = self.calculate_score(request, features)
return self.build_response(request, bot_score)

Degradation hierarchy:

  1. Preferred: Use current features
  2. Fallback: Use cached features (slightly stale but working)
  3. Last resort: Disable module, allow traffic through

Never fail completely and return 5xx errors.

Lesson 3: Global Kill Switches

The Problem: No way to quickly disable Bot Management when it started failing.

The Fix: Every feature needs a global off switch.

Feature Flag Implementation

class FeatureFlagSystem:
"""Centralized feature flag management with kill switches."""
def __init__(self):
self.flags = self._load_flags_from_config()
self.kill_switches = KillSwitchManager()
def is_enabled(self, feature_name: str, request_context: dict = None) -> bool:
"""Check if feature is enabled, respecting kill switches."""
# Check 1: Global kill switch (highest priority)
if self.kill_switches.is_active(f"{feature_name}_global_disable"):
log_info(f"Feature {feature_name} disabled by global kill switch")
metrics.increment(f"{feature_name}.killed_globally")
return False
# Check 2: Datacenter-specific kill switch
if request_context:
dc = request_context.get('datacenter')
if self.kill_switches.is_active(f"{feature_name}_disable_{dc}"):
return False
# Check 3: Normal feature flag
flag = self.flags.get(feature_name)
if not flag or not flag.enabled:
return False
# Check 4: Automatic circuit breaker
if self._is_circuit_broken(feature_name):
log_warning(f"Feature {feature_name} circuit broken due to errors")
return False
return True
def _is_circuit_broken(self, feature_name: str) -> bool:
"""Automatic circuit breaker based on error rates."""
error_count = metrics.get(f"{feature_name}.errors.count.5min")
error_rate = metrics.get(f"{feature_name}.errors.rate.5min")
# Break circuit if error rate > 50% or errors > 1000/min
if error_rate and error_rate > 0.5:
return True
if error_count and error_count > 1000:
return True
return False
# Usage in Bot Management
def process_request(request):
if not feature_flags.is_enabled('bot_management', request.context):
# Feature disabled, skip bot management
return process_without_bot_scoring(request)
# Normal bot management processing
return process_with_bot_management(request)

CLI for Operators

Terminal window
# At 11\:30, an operator could have run:
cloudflare-ctl kill-switch enable bot_management_global_disable \
--reason "5xx errors from feature file issue" \
--duration 1h
# Output:
# ✅ Kill switch activated globally in 847ms
# ✅ Bot Management disabled across all datacenters
# ✅ Fallback mode: Allow all traffic (no bot scoring)
# ✅ Estimated impact reduction: 99% within 30 seconds

With a kill switch, the outage could have been mitigated in minutes instead of hours.

Lesson 4: Rate-Limit Observability

The Problem: Debug systems consumed massive CPU during mass failures, making the outage worse.

The Fix: Adaptive observability that reduces overhead during high-error scenarios.

Adaptive Error Sampling

class AdaptiveObservability:
"""Observability system that adapts to error rates."""
def __init__(self):
self.sample_rate = 1.0 # 100% sampling normally
self.error_rate_threshold = 100 # errors/sec
self.high_volume_sample_rate = 0.01 # 1% during incidents
def record_error(self, error: Exception, context: dict):
"""Record error with adaptive sampling."""
current_error_rate = self._get_error_rate()
# Adapt sample rate based on error volume
if current_error_rate > self.error_rate_threshold:
# High error rate: reduce sampling to save CPU
self.sample_rate = self.high_volume_sample_rate
metrics.increment("observability.high_error_mode")
else:
# Normal error rate: full sampling
self.sample_rate = 1.0
# Sample decision
if not self._should_sample():
# Still count the error, just don't enhance it
metrics.increment(f"errors.{error.__class__.__name__}")
return
# Full error enhancement (expensive)
enhanced_error = self._enhance_error(error, context)
self._log_to_storage(enhanced_error)
def _should_sample(self) -> bool:
"""Probabilistic sampling."""
return random.random() < self.sample_rate
def _enhance_error(self, error: Exception, context: dict) -> dict:
"""Enhance error with stack trace, context, etc. EXPENSIVE."""
return {
'error': str(error),
'type': error.__class__.__name__,
'stack_trace': traceback.format_exc(), # Expensive!
'context': context,
'timestamp': time.time(),
'request_id': context.get('request_id'),
# ... more expensive debugging data
}

Key insight: During incidents with millions of errors, you don’t need to enhance every single one. Sample 1% for debugging, count 100% for metrics.

Lesson 5: Test Your Assumptions

The Problem: Query assumed only default database visible - never tested against new permissions.

The Fix: Property-based testing and assumption documentation.

Property-Based Testing

from hypothesis import given, strategies as st
import pytest
class TestFeatureFileGeneration:
@given(
feature_count=st.integers(min_value=0, max_value=500),
database_count=st.integers(min_value=1, max_value=5)
)
def test_feature_file_handles_any_count_gracefully(
self, feature_count, database_count
):
"""
Test that feature file generation handles any number of
features and databases gracefully.
"""
# Simulate varying database visibility
mock_query_result = generate_mock_columns(
feature_count, database_count
)
result = generate_feature_file(mock_query_result)
if feature_count > MAX_FEATURES:
# Should return error, not crash
assert result.is_err()
assert "too many features" in result.error_message().lower()
else:
assert result.is_ok()
assert len(result.features) <= MAX_FEATURES
def test_query_filters_database_explicitly(self):
"""
CRITICAL: Ensure query explicitly filters by database.
This test would have caught the Cloudflare bug!
"""
query = get_feature_discovery_query()
# Assertion: Query must contain database filter
assert "database = 'default'" in query.lower(), \
"Query must explicitly filter by database name!"

Document Assumptions

def get_feature_discovery_query() -> str:
"""
Get SQL query to discover available features.
ASSUMPTIONS (CRITICAL - TEST THESE!):
1. Only returns columns from 'default' database
2. Returns at most MAX_FEATURES results
3. No duplicate feature names in result
4. 'http_requests_features' table exists
QUERY DEPENDENCIES:
- ClickHouse permissions: assumes 'default' database is visible
- Schema: assumes 'system.columns' table exists
If permissions change, THIS QUERY MAY BREAK!
"""
return """
SELECT name, type
FROM system.columns
WHERE database = 'default' -- CRITICAL: Explicit filter!
AND table = 'http_requests_features'
ORDER BY name
"""

Lesson 6: Make Assumptions Explicit

The Problem: Implicit assumption that query would only return one database.

The Fix: Document, test, and validate all assumptions.

Assumption Validation

class AssumptionValidator:
"""Validate runtime assumptions that code depends on."""
@staticmethod
def validate_feature_query_assumptions(query_result: List[dict]):
"""
Validate assumptions about feature query results.
Fail fast if assumptions are violated.
"""
# Assumption 1: All rows are from 'default' database
databases = {row['database'] for row in query_result}
assert databases == {'default'}, \
f"Expected only 'default' database, got: {databases}"
# Assumption 2: No duplicate feature names
names = [row['name'] for row in query_result]
duplicates = find_duplicates(names)
assert not duplicates, \
f"Duplicate feature names: {duplicates}"
# Assumption 3: Feature count within limits
assert len(query_result) <= MAX_FEATURES, \
f"Feature count {len(query_result)} exceeds max {MAX_FEATURES}"

Use in production:

def generate_feature_file():
query = get_feature_discovery_query()
results = execute_query(query)
# Validate assumptions BEFORE processing
AssumptionValidator.validate_feature_query_assumptions(results)
return build_feature_file(results)

If assumptions are violated, fail fast with a clear error message rather than silently producing bad data.

Cloudflare’s Commitments

Cloudflare publicly committed to four specific actions:

  1. Harden config file ingestion: Validate all machine-generated files
  2. Enable global kill switches: Every module gets a disable switch
  3. Prevent resource exhaustion: Rate-limit observability during incidents
  4. Review failure modes: Audit all error paths for graceful degradation

Universal Lessons

These lessons apply to any distributed system:

1. Validate Everything

Even your own generated data. Environments change, assumptions break.

2. Fail Gracefully

Every component should have: working mode → degraded mode → disabled mode.
Never: working mode → crash.

3. Quick Mitigation Beats Perfect Diagnosis

Kill switches and circuit breakers let you stop the bleeding while you figure out the root cause.

4. Observability Has Cost

During incidents, observability can become a resource burden. Plan for this.

5. Test Assumptions

Document what your code assumes about its environment. Test those assumptions.

6. Defense in Depth

One missing layer shouldn’t cause catastrophic failure. Build multiple safety nets.

Conclusion

The Cloudflare outage teaches us that resilient systems require multiple layers of defense:

  • Input validation (even for machine-generated data)
  • Graceful degradation (fail open, not catastrophically)
  • Kill switches (fast mitigation without root cause fix)
  • Adaptive observability (reduce overhead during incidents)
  • Assumption testing (validate what code depends on)
  • Explicit documentation (make implicit assumptions visible)

One line of SQL caused a 6-hour outage because all these layers were missing. Build your systems so that one missing line causes a log warning, not a global outage.


Important

The Meta-Lesson: Great engineering teams aren’t defined by avoiding failures - they’re defined by learning from them and systematically preventing recurrence.

Series Complete: ← Back to Overview