20 Plus Years of Chronic Configuration and Change Challenges
Looking back over the years, many big names in various industries have not only experienced painful outages but have also had to struggle to reaffirm confidence in their services and reputation. A few things evidenced by this blog:
- Complexity continues to increase.
- Human error continues to perplex organizations.
- Misconfigurations and unauthorized changes continue to occur.
- And, even authorized changes do not mean that things are clear and systems are good.
Despite advances in infrastructure robustness - downtime still occurs. Moreover, despite large investments of personnel time and budget dollars, configuration errors resulting in application downtime have cost companies thousands and even millions. Infrastructure vulnerabilities, authorized and unauthorized changes, and even simple mistakes add to the risk of costly downtime events.
And despite years of examples of what not to do, advanced knowledge of what to do, and technological advancements to aid this situation, we are still struggling to gain an advantage over these 'vulnerabilities'. The disturbing truth is that downtime still haunts most organizations and outages are becoming longer and more expensive per hour.
In 2013 when this blog was originally written, the most common cause of downtime was "systems failures, followed by human error and then natural disasters." Today - the order is not much different; however, natural disasters take a back seat to technology complexity and software bugs - now introduced in our pursuit of agile development and the perfect customer experience.
IT teams face many issues that they must stay on top of to maintain top performance and availability. We, however, see that news headlines have remained consistent over the years. Change and configuration management challenges are chronic, and while the names of the companies impacted have changed (somewhat), the bottom line remains the same - these challenges lead to critical operational issues and have not gone away. One of the most immediate costs of system downtime is corporate image. While varying greatly by business, for some companies, the damage goes beyond monetary valuation.
Take a look at these representative examples we have compiled from the last 20-plus years that took place due to failures stemming from infrastructure or application issues that spiraled out of control due to configuration changes.
2023. Configuration Glitch Caused Microsoft's Jan. 25th Exchange Online Disruption
This week's 5-hour service incident affected the Exchange Online Service as well as other Microsoft 365 services, apparently different from a 3-hour service interruption from the day before.
2022. Microsoft Confirms Misconfiguration of Azure Blob Storage Resulted in Leak
Microsoft confirms it inadvertently exposed information related to thousands of customers following a security lapse that left an endpoint accessible over the internet due to a misconfiguration.
2021. Facebook's Giant Outage: Says a Configuration Issue Knocked its Social Media Apps Offline
The six-hour outage in October that knocked out Facebook, Instagram, and WhatsApp for billions of users was blamed on a faulty configuration change on its backbone routers responsible for transmitting traffic across its data centers. This outage likely cost the company over $60 M.
2020. Google Experiences Major, Albeit Short Outage.
Google had a major service outage – but this time, it only lasted about 45 minutes. It still impacted its users worldwide as Google search, Gmail, YouTube, Google Calendar, and third-party applications and authentication were all impacted. The issue was blamed on a storage capacity setting for the company's authentication services.
2019. Capital One Suffers Breach Due to Misconfigured Firewall
The Capital One breach was directly attributed to a misconfigured firewall that left one of their cloud servers vulnerable and allowed a hacker to access sensitive data. Over 100 million customers had their information compromised in this breach.
2018, Solarwinds “Sunburst” Attack Due to Misconfiguration
In a 2020 SEC filing, Solarwinds stated that it was made aware of a cyberattack that inserted a backdoor into its monitoring software used extensively by US Federal agencies. The attack was ongoing in January 2019 so can be assumed to have begun in 2018 or before.
2017. Amazon S3 Outage
Bottom line: The system went down because of a typo. A small mistake that only took 4 hours to track down, but one that had a huge impact on customer checkout processes. The damage was significant, impacting both profits and reputation.
2016. Dyn suffers the biggest DDoS attack in History
The attack happened in three waves and was no doubt one of the biggest DDoS attacks in history. This major backbone provider was overwhelmed, leaving internet users unable to access partner platforms such as Twitter, Spotify, etc.
2015. United Airlines Grounded for 2 Hours
A router misconfiguration caused a major business impact to airline giant, United Airlines in July. After grounding 90 aircraft at US airports for over 2 hours, United experienced flight and customer disruption as well as negative reputational impacts.
2014. Time Warner Cable Internet Outage
During an overnight network maintenance activity an erroneous configuration was propagated throughout their national backbone, resulting in a network outage.
2013. Misconfiguration Strikes Again, Setting Off Google Apps Outage.
The problem, which lasted for about three hours on Wednesday morning, occurred when the main user authentication system for Google applications was misconfigured.
2012. Critical Change Leaves Facebook Out of Reach.
Facebook went down due to a change made to the infrastructure. In complex dynamic ecosystems, such as Facebook's IT infrastructure, change happens a lot. On any given day, infrastructure is being upgraded, patches are being installed, automated processes are running that alter files, and system environments and configurations are also manually being changed. Sometimes these activities are performed correctly and ... sometimes they're not.
2012. Merging United and Continental Computer Systems Grounds Passengers.
In one of the final steps involved in merging the two airline companies, United reported technical issues after its Apollo reservations system was switched over to Continental's Shares program. United struggled through at least three days of higher call volumes after the meshing of the systems and websites caused problems with some check-in kiosks and frequent-flier mileage balances. The glitch was another in a long string of technology problems that began In March.
2012. GMail Crashes Following Configuration Change.
The outage, according to Google, has been attributed to Google's Sync Server, in relying on a component to enforce quotas on per-datatype sync traffic, failed. The quota service "experienced traffic problems today due to a faulty load balancing configuration change."
2011. Amazon outage sends prominent Websites offline, including Quora, Foursquare, and Reddit.
Amazon released a detailed postmortem and mea culpa about the partial outage of its cloud services platform last week and identified the culprit: A configuration error made during a network upgrade.
2010. Massive failure knocks Singapore's DBS Bank off the banking grid for seven hours.
A faulty component within the disk storage subsystem serving the bank's mainframe was resulting in periodic alert messages, which saw a scheduled job to replace it at 3 a.m. that fateful day. The situation spiraled out of control as a direct result of human error in the routine operation.
2009. Widespread trouble with Google Apps service.
Google Search and Google News performance slowed to a crawl, while an outage seemed to spread from Gmail to Google Maps and Google Reader. Comments about the failure were flying on Twitter, with " googlefail" quickly becoming one of the most searched terms on the popular micro-blogging site.
2008. Gmail outage lasts about 30 hours.
The first problem reports started appearing in the official Google Apps discussion forum around mid-afternoon Wednesday. At around 5 p.m. that day, Google acknowledged that the company was aware of a problem preventing Gmail users from logging into their accounts and that it expected a solution by 9 p.m. on Thursday.
2007. Skype is down.
Skype advised that their engineering team had determined that the downtime was due to a software issue, with the problem expected to be solved "within 12 to 24 hours."
2007. Some Amazon EC2 customer instances were terminated and unrecoverable.
A software deployment caused management software to erroneously terminate a small number of user instances. When monitoring detected this issue, the EC2 management software and APIs were disabled to prevent further terminations.
2005. Faulty database derails SalesForce.com.
A Salesforce.com outage lasting nearly a day cut off access to critical business data for many of the company's customers in what appeared to be Salesforce's most severe service disruption to date.
2004. Unscheduled software upgrade grinds UK's Department for Work and Pensions to a halt.
Some 40,000 computers in the UK's Department for Work and Pensions (DWP) were unable to access their network last month when an IT technician erroneously installed a software upgrade.
2003. Glitch in upgrade of AT&T Wireless CRM system causes a break.
AT&T Wireless Services Inc. this week faced the software nightmare every IT administrator fears: An application upgrade last weekend went awry, taking down one of the company's key account management systems.
2002. New high-bandwidth application triggers outage.
The network grew very quickly due to business changes and was never redesigned to cope with the much larger scale and new application requirements.
2001. For 12 hours, TD Canada Trust's 13 million customers couldn't touch their money
TD Canada Trust ran full-page apologies in newspapers across the country Monday, saying it was sorry for a weekend computer crash that left millions of its customers unable to access their accounts. The bank said the outage was caused by "a rare and isolated hardware problem".
2000. Software-upgrade glitch leaves flights on the tarmac.
The Federal Aviation Administration (FAA) had called a halt on all flights scheduled to land at or depart from Los Angeles International Airport for four hours that morning. Technicians loading an upgrade to radar software at the Los Angeles air-traffic control center caused a mainframe host computer to crash.
Don't let this be your organization. Contact Evolven today to see how we can help you get in front of misconfigurations, unauthorized changes, and more.