AWS Apologizes For Cloud Outage, Blames Typo
Amazon Web Services Thursday said an employee trying to debug its billing system entered a command incorrectly, causing the four-hour outage that disrupted service to Amazon S3 clients around the world earlier this week.
In a postmortem detailing the events that led to Tuesday's high-profile disruption, AWS apologized for the impact on customers and outlined the changes it is making to prevent the problem from occurring again.
The crash started with a bad command entered during a debugging process on its S3 billing system at 9:37 am PT in the provider's Virginia data center that serves the eastern region of the United States.
[Related: Partners: AWS Outage Puts Spotlight Onn Private Cloud Advantages]
An employee, using "an established playbook," intended to pull down a small number of servers that hosted subsystems for the billing process. Instead, the accidental command resulted in a far broader swath of servers being taken offline, including one subsystem necessary to serve specific requests for data storage functions, and another allocating new storage, AWS said in the postmortem.
"Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests," the postmortem said.
The restart process then took longer than expected, largely due to the cloud's growth in recent years, AWS said.
"While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected," AWS said.
Even after restarts are completed, problems typically persist a while longer due to backlogs created from the downtime, said Mat Elis, CEO of Cloudability, a cloud management software vendor.
Think of the analogy of a series of phone lines, Elis said.
"Normally the phone lines receive five calls a minute. The phone system went down, and callers get an engaged tone, so they start calling back. After one minute there are 10 callers, after five minutes 25 callers, all trying to call the five lines."
That kind of scenario, with robots instead of human callers, is what happened at AWS. After the reset, automated systems across Amazon's massive customer base tried to resume stalled processes, such as uploading or downloading files to and from S3, all at once. It takes time for a provider to catch up with backlog and resume normal operations.
Amazon has already modified the tool used to pull down the intended servers, which allowed "too much capacity to be removed too quickly," the statement said.
The tool has been updated to remove servers slower and add safeguards that prevent removing servers in situations where it will bring the system below a minimum level of capacity.
"We are making several changes as a result of this operational event," said AWS.
Tuesday's outage took down many popular websites and enterprise services like Slack and Trello that rely on the public cloud giant. That not only embarrassed the cloud leader, but delivered a marketing windfall to rivals.
"This is going to force a Google conversation. They just handed Google thousands of customers," Jamie Shepard, senior vice president at Google partner Lumenate, told CRN earlier this week.