What went wrongDetails are still coming out, but as we understand it, Amazon's EBS file storage system in its US East region failed spectacularly. The Amazon status page puts it this way:
A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances.In other words, problems with some EBS volumes cascaded and grew until all EBS volumes were taken down. The capacity issues compounded in such a way that it was extremely time consuming to restore the system to health. In fact, now - 40 hours after the first problem - Amazon is still in the recovery process, with no immediate end in sight. Zencoder uses Amazon in two ways. First, we encode most of our video using EC2. We're big proponents of EC2, and believe that EC2 is a great system for large-scale video encoding. It's important to note that the Amazon outage didn't affect our ability to run encoding servers; if it weren't for the next point, we would have had no problems yesterday. Second, we run our dashboard and API on Amazon EC2, and depend on EBS. This was the cause of the problem yesterday. The EBS outage took down our database, and rebuilding took significantly longer than anyone expected. We worked to get this back online, but Amazon provided very little information, and eventually it became clear we had no idea if, when, or how service would be restored. EBS was a single point of failure for us. Not in the sense that we couldn't tolerate a single EBS failure; the failure of a running EBS volume wouldn't have caused problems like this. Rather, our point of failure was reliance on the EBS system as a whole. We anticipated "what if our EBS volume has trouble?", but not "what if the entire EBS platform becomes unavailable?". Beyond that, we didn't have the right plan in place to deal with a catastrophic outage. We do now, and more on that in a bit.