Our services had reduced functionality for several hours on Wednesday, February 28, 2017. For example, plan sheets were not available on www.bidx.com, document contents were unavailable throughout the Doc Express service, and delivery of email notices was delayed for several hours. We apologize for the outage and are investigating ways to avoid or reduce similar outages in the future.
We would also like to explain the causes of the outage. We run our services on the Amazon Web Services (AWS) cloud. One of their offerings, called Simple Storage Service or S3, had a complete outage for over two hours followed by a partial outage for about an hour in their Northern Virginia region. Many other AWS services rely on S3, and also experienced partial or total outages in Northern Virginia as a result. Those outages caused the problems in our own services. As AWS recovered, so did we.
Amazon has described the cause of the outage and the steps they are taking to prevent a recurrence at https://aws.amazon.com/message/41926/. The write-up is fairly technical, but it boils down to human error: a command used to maintain their service was entered with a typo. Thanks to the typo, a large portion of the infrastructure used for S3 was shut down instead of the small portion that was intended. The S3 service couldn’t function with so much of it being shut down, so AWS had to restart it, which took much longer than expected.
We continue to believe that using AWS is the best option for making our services secure and reliable. They have been operating since 2006 with fewer interruptions than occur in a typical data center, and their response to the problems that do happen is to engineer their systems to prevent any possible repeat. For example, they have modified their tools to refuse to accept commands that would have too large an impact on running services, preventing this problem from recurring.