Amazon’s recent outage that affected thousands of companies and millions of people was caused by a DNS issue stemming from two automated systems updating the same data simultaneously, leading to a 15-hour outage. The root cause was a “latent defect” in DynamoDB’s DNS management system, which resulted in an empty DNS record for the service’s regional endpoint.
The outage had significant impacts on many AWS services, including EC2 and Network Load Balancer. Cybersecurity risk analytics firm CyberCube has released a preliminary insured loss estimate of up to $581 million, but this may be limited by AWS’ reimbursement policy for affected companies.
AWS is making changes to its systems, including fixing the “race condition scenario” that caused the two automated systems to overwrite each other’s work. The company will also build an additional test suite to detect similar bugs and improve throttling mechanisms.
A former AWS executive, Debanjan Saha, stated that the outage was “inevitable” due to the massive scale and complexity of AWS’ distributed systems. However, Saha emphasized the importance of having a clear strategy for resiliency, including thinking beyond a single provider and building for multi-region and multi-cloud or hybrid environments.
Overall, the AWS outage highlights the need for companies to prioritize availability and resilience in their cloud infrastructure strategies.
Source: https://www.crn.com/news/cloud/2025/amazon-s-outage-root-cause-581m-loss-potential-and-apology-5-aws-outage-takeaways