Cloudflare's November outage traces to a single-column database permission change
A routine security improvement to a ClickHouse cluster doubled a bot-management feature file and propagated failure across Cloudflare's edge, masquerading for hours as a hyperscale DDoS.
The post-mortem reads like a mechanical diagram of modern infrastructure: a permissions change meant to make table access explicit caused a metadata query in a ClickHouse cluster to return duplicate rows. The duplicate rows doubled the size of a feature file the Bot Management system distributes to every edge machine in Cloudflare's network. The too-large file propagated. The edge started failing. From 11:20 UTC on 18 November 2025 until roughly 14:30, core traffic for a substantial share of the public internet was impaired.
The initial diagnosis was worse than the actual cause. Because the update was rolling out in stages, the system flipped between a good state and a bad state on a cadence of minutes, and the on-call teams spent the first hour of the incident convinced they were looking at a hyperscale DDoS. That misdiagnosis — detailed in Cloudflare's public post-mortem with unusual forthrightness — is arguably the most important detail in the report. An internal, correlated failure can masquerade as an external, adversarial failure for exactly as long as the observability stack cannot distinguish the two.
The specific technical lesson Cloudflare extracts is narrow: feature files consumed by production paths must be sanity-checked for size and shape at ingestion, not just at publication. The broader lesson is about blast radius. A modification to one metadata query in one database cluster should not, under any sensible architecture, be able to take down a large fraction of the company's edge. That it did is the kind of coupling that accumulates in a mature system over many years of shipped features and retired abstractions, and is extraordinarily difficult to retire without intentional work.
The winners are the enterprise buyers now using the post-mortem as an evidence exhibit in their next multi-CDN architecture review — a category of purchase that the industry has discussed for a decade and largely declined to fund. The losers are Cloudflare's competitive pitch around single-vendor simplicity, at least for the tier of customers whose risk tolerance moved during the four-hour window, and the broader assumption that edge providers had engineered past the class of incident that used to plague origin infrastructure.
What the outage forecloses is the comfortable idea that the edge is a solved problem. It is not; it is a system whose complexity is still growing faster than its defences against its own complexity. What the public post-mortem opens is something more useful: an industry-readable worked example of how a small permissions change becomes a global outage, documented with enough specificity for the next operator to avoid exactly this failure mode, even as they introduce the next one.
