2025-10-23 Summary of the Amazon DynamoDB Service Disruption in Northern Virginia (US-EAST-1) Region

https://aws.amazon.com/message/101925/

Amazon Web Services experienced a major service disruption in the Northern Virginia (us-east-1) Region from October 19, 2025, 11:48 PM PDT to October 20, 2025, 2:20 PM PDT. The outage impacted DynamoDB, EC2, Network Load Balancers (NLB), and multiple dependent AWS services.

Key Points

Primary Cause
- A latent race condition in DynamoDB’s DNS management system caused an empty DNS record for the regional endpoint, preventing new connections.
- Manual operator intervention was required because the system entered an inconsistent state that DNS automation could not resolve.
Service Impact Timeline
- DynamoDB (11:48 PM – 2:40 AM)
  - DNS resolution failures caused API errors and prevented new connections.
  - Global table replication continued but with delays to/from us-east-1.
  - Full recovery at 2:40 AM.
- EC2 (11:48 PM – 1:50 PM)
  - Instance launches failed due to DWFM (DropletWorkflow Manager) lease expirations caused by failed DynamoDB-dependent state checks.
  - Network configuration propagation was delayed, leading to connectivity issues. Full recovery at 1:50 PM.
- Network Load Balancer (5:30 AM – 2:09 PM)
  - Health check failures occurred because new EC2 instances lacked network configuration.
  - Triggered failovers, DNS removals, and connection errors. Stabilized after disabling auto failover.
- Other Services
  - Lambda, ECS/EKS, Fargate, STS, Redshift, and Amazon Connect experienced errors due to dependencies on DynamoDB, EC2 launches, and NLB health checks.
  - Most services recovered by early afternoon, with Redshift cluster availability fully restored by October 21, 4:05 AM PDT.
Remediation and Future Improvements
- DynamoDB DNS automation globally disabled until the race condition is fixed and safeguards are added.
- EC2 will implement better throttling and DWFM recovery tests.
- NLB will add velocity controls for AZ failover events.
- AWS is enhancing scale testing and cross-service dependency resilience to shorten recovery times.

Mermaid Sequence Diagram of the Incident

This incident demonstrates the cascading impact of interdependent cloud services when a single DNS automation failure affects core APIs, compute instances, and network load balancers across multiple services.

Previous2025-10-20 Inside the breach that broke the internet - The untold story of Log4Shell Next2025-11-01 Why one of the world’s most brilliant AI scientists left the US for China

Last updated 3 months ago

Was this helpful?

Good evening

hashtagKey Points

hashtagMermaid Sequence Diagram of the Incident

Key Points

Mermaid Sequence Diagram of the Incident