Google Cloud suffered a major outage last week, affecting customers for at least three hours. The incident was caused by the company’s failure to follow its usual code quality protections.
According to an internal investigation, a new feature added to Service Control, a core component of Google API management and control planes, caused the binary to crash due to inadequate error handling. This change was rolled out region by region but did not have necessary checks in place to prevent errors.
The outage impacted Cloudflare services, causing disruptions to customers. The Site Reliability Engineering team identified the root cause within 10 minutes and initiated recovery efforts within 40 minutes. However, a “herd effect” on underlying infrastructure led to further issues, taking almost three hours to resolve.
Google has promised to improve external communication with customers during outages, ensuring they receive information quickly to manage their systems and help their customers. This initiative acknowledges that the company did not provide sufficient information during this incident.
Moreover, Google recognizes that it cannot avoid big outages and plans to implement changes to prevent similar incidents in the future. The company will enhance its monitoring and communication infrastructure to ensure business continuity even when primary monitoring products are down.
Source: https://www.theregister.com/2025/06/16/google_cloud_outage_incident_report