A Single User Error Causes Cloud-wide Failure at Joyent
Cloud provider Joyent suffered an outage, causing severe inconvenience to customers. While all kinds of precautions can be implemented, data center and cloud outages still seem susceptible to accidents from within, as happened recently when due to operator error a data center was brought down operated by cloud provider Joyent. Joyent provides public and private cloud infrastructure services for companies that need more computing horsepower than the mainstream Infrastructure-as-a-Service providers, such as Amazon Web Services, can offer.
InfoWorld declared that "an error of this magnitude shouldn't be allowed to happen."
Data Center Knowledge reported that the "While human error was at fault, Joyent's system ideally would have been built to withstand such errors."
The Register stated that a "Fat-fingered admin downs entire Joyent data center."
In my exp. a lot of distributed sys failures are like this Joyent outage: people making mistakes w/ complex tools http://t.co/OV33OdioYy
— Jay Kreps (@jaykreps) May 29, 2014
So What Caused the Outage?
As reported in a postmortem provided by Joyent, that "due to an operator error, all us-east-1 API systems and customer instances were simultaneously rebooted at 2014-05-27T20:13Z (13:13PDT). Rounded to minutes, the minimum downtime for customer instances was 20 minutes, and the maximum was 149 minutes (2.5 hours). 80 percent of customer instances were back within 32 minutes, and over 90 percent were back within 59 minutes. The instances that took longer than others were due to a few independent isolated problems which are described below."
Going deeper into the, it was concluded that the "Root cause of this incident was the result of an operator performing upgrades of some new capacity in our fleet, and they were using the tooling that allows for remote updates of software. The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the datacenter. Unfortunately the tool in question does not have enough input validation to prevent this from happening without extra steps/confirmation, and went ahead and issued a reboot command to every server in us-east-1 availability zone without delay."
Operator error takes down joyent http://t.co/3vioajGWLb #cloud #outage #amazon #ceo #startup
— Sean Hull (@hullsean) June 3, 2014
Today's IT Operations
Just like any IT system, cloud-based services and servers can suffer from outages, but because the large number of users, consequences are usually larger. This is evident for in the the data center, where change is a constant. With the recent headline-grabbing outages, there is concern and hesitation of relying on Public clouds. In the age of the cloud, IT environments are growing increasingly complex, and simply trying to understand what is happening within the environment at any given time is a major challenge. The problem, of course, is that not only does each update change the parameter of applications, but the application workloads increasingly move from one virtual machine to another.IT Operations Analytics
Not only at the Joyent, but today's IT Operations organizations face new levels of challenges that can no longer be handled with traditional approaches. This means applying some more serious brain power to help deal with the complexity and dynamics of today's IT environments. IT Operations Analytics delivers the intelligence IT operations organization are craving, allowing them to turn piles of IT operations' data into actionable information.
In our upcoming webinar, we explore how "Gartner recognized IT Operations Analytics as an area on the rise with high impact on IT operations."