The Netflix New Year’s Eve outage is the second major streaming outage for the popular service this month — and as Amazon apologized for the first incident, a second fortuitously occurred as the apology was posted.
The Netflix New Year’s Eve outage was reported heavily by frustrated subscribers on Twitter, many of whom had settled in with booze and pizza only to discover that the access to Breaking Bad was not going to be part of the plan due to a service blip.
As Amazon apologized for the previous holiday outage, a second, perhaps smaller one was underway, which was unfortunate — but it seems perhaps demand outstripped the service’s capabilities on days of high traffic.
According to Amazon, the Netflix outage began on December 24 at 12:24 PM, and the apology begins:
“The service disruption began at 12:24 PM PST on December 24th when a portion of the ELB state data was logically deleted. This data is used and maintained by the ELB control plane to manage the configuration of the ELB load balancers in the region (for example tracking all the backend hosts to which traffic should be routed by each load balancer). The data was deleted by a maintenance process that was inadvertently run against the production ELB state data. This process was run by one of a very small number of developers who have access to this production environment.”
I’ve learned a lot about myself this break–I mean, Netflix. I’ve learned a lot about Netflix this break.
— Greg Baumann (@glbaumann) December 31, 2012
“Unfortunately, the developer did not realize the mistake at the time. After this data was deleted, the ELB control plane began experiencing high latency and error rates for API calls to manage ELB load balancers. In this initial part of the service disruption, there was no impact to the request handling functionality of running ELB load balancers because the missing ELB state data was not integral to the basic operation of running load balancers.”
“We have made a number of changes to protect the ELB service from this sort of disruption in the future. First, we have modified the access controls on our production ELB state data to prevent inadvertent modification without specific Change Management (CM) approval.”
— All Things D (@allthingsd) December 31, 2012
“Normally, we protect our production service data with non-permissive access control policies that prevent all access to production data. The ELB service had authorized additional access for a small number of developers to allow them to execute operational processes that are currently being automated. This access was incorrectly set to be persistent rather than requiring a per access approval.”
Finally, the statement concludes:
“Last, but certainly not least, we want to apologize. We know how critical our services are to our customers’ businesses, and we know this disruption came at an inopportune time for some of our customers. We will do everything we can to learn from this event and use it to drive further improvement in the ELB service.”
As of now, neither Netflix nor Amazon have addressed the reported New Year’s Eve outage, and it isn’t clear exactly what may have precipitated the issue.
why do people feel that the only way to have fun is partying and doing drugs. have you even tried pizza and netflix? DAMN
— Monica Hernandez (@Monicahh4) December 31, 2012