Cloudflare's Latest Outage: A Cautionary Tale of Global Configuration Changes

Cloudflare's Latest Outage: A Cautionary Tale of Global Configuration Changes

Introduction

The recent Cloudflare outage, which occurred just two weeks after a previous major outage, serves as a stark reminder of the dangers of global configuration changes. In this article, we'll delve into the cause of the latest outage, explore the pattern of global configuration errors, and discuss the importance of implementing staged configuration rollouts.

The Latest Outage

On December 5th, Cloudflare suffered a 25-minute global outage, affecting approximately 28% of its HTTP traffic. The cause of the outage was a seemingly innocuous global configuration change, which was intended to fix a React security vulnerability. However, the fix caused an error in an internal testing tool, leading to a bug that resulted in HTTP 500 errors across Cloudflare's network.

What Went Wrong

The sequence of events leading to the outage was as follows:

  • Cloudflare rolled out a fix for a React security vulnerability
  • The fix caused an error in an internal testing tool
  • The Cloudflare team disabled the testing tool with a global killswitch
  • The global configuration change unexpectedly caused a bug, resulting in HTTP 500 errors

A Pattern of Global Configuration Errors

This latest outage is not an isolated incident. There have been several high-profile outages in recent years caused by global configuration errors