1 (866) 866-2320 Straight Talks Events Blog

20 Plus Years of Chronic Configuration and Change Challenges

Blog

20 Plus Years of Chronic Configuration and Change Challenges

About

This content is brought to you by Evolven. Evolven Change Analytics is a unique AIOps solution that tracks and analyzes all actual changes carried out in the enterprise cloud environment. Evolven helps leading enterprises cut the number of incidents, slash troubleshoot time, and eliminate unauthorized changes. Learn more

Looking back over the years, many big names in various industries have not only experienced painful outages but have also had to struggle to reaffirm confidence in their services and reputation. A few things evidenced by this blog:

  • Complexity continues to increase.
  • Human error continues to perplex organizations.
  • Misconfigurations and unauthorized changes continue to occur.
  • And, even authorized changes do not mean that things are clear and systems are good.

Despite advances in infrastructure robustness - downtime still occurs. Moreover, despite large investments of personnel time and budget dollars, configuration errors resulting in application downtime have cost companies thousands and even millions. Infrastructure vulnerabilities, authorized and unauthorized changes, and even simple mistakes add to the risk of costly downtime events.

And despite years of examples of what not to do, advanced knowledge of what to do, and technological advancements to aid this situation, we are still struggling to gain an advantage over these 'vulnerabilities'. The disturbing truth is that downtime still haunts most organizations and outages are becoming longer and more expensive per hour.

In 2013 when this blog was originally written, the most common cause of downtime was "systems failures, followed by human error and then natural disasters."  Today - the order is not much different; however, natural disasters take a back seat to technology complexity and software bugs - now introduced in our pursuit of agile development and the perfect customer experience.

IT teams face many issues that they must stay on top of to maintain top performance and availability. We, however, see that news headlines have remained consistent over the years. Change and configuration management challenges are chronic, and while the names of the companies impacted have changed (somewhat), the bottom line remains the same - these challenges lead to critical operational issues and have not gone away. One of the most immediate costs of system downtime is corporate image. While varying greatly by business, for some companies, the damage goes beyond monetary valuation. 

Take a look at these representative examples we have compiled from the last 20-plus years that took place due to failures stemming from infrastructure or application issues that spiraled out of control due to configuration changes. 

2023. Configuration Glitch Caused Microsoft's Jan. 25th Exchange Online Disruption

This week's 5-hour service incident affected the Exchange Online Service as well as other Microsoft 365 services, apparently different from a 3-hour service interruption from the day before.

Microsoft Configuration Glitch

2022. Microsoft Confirms Misconfiguration of Azure Blob Storage Resulted in Leak

Microsoft confirms it inadvertently exposed information related to thousands of customers following a security lapse that left an endpoint accessible over the internet due to a misconfiguration.

Microsoft Confirms Server Misconfiguration Led to 65,000+ Companies’ Data Leak 

2021. Facebook's Giant Outage: Says a Configuration Issue Knocked its Social Media Apps Offline

The six-hour outage in October that knocked out Facebook, Instagram, and WhatsApp for billions of users was blamed on a faulty configuration change on its backbone routers responsible for transmitting traffic across its data centers. This outage likely cost the company over $60 M.

Facebook’s Giant Outage: This Change Caused All the Problems

2020. Google Experiences Major, Albeit Short Outage.

Google had a major service outage – but this time, it only lasted about 45 minutes. It still impacted its users worldwide as Google search, Gmail, YouTube, Google Calendar, and third-party applications and authentication were all impacted. The issue was blamed on a storage capacity setting for the company's authentication services.

Google Suffers Global Outage

2019. Capital One Suffers Breach Due to Misconfigured Firewall

The Capital One breach was directly attributed to a misconfigured firewall that left one of their cloud servers vulnerable and allowed a hacker to access sensitive data. Over 100 million customers had their information compromised in this breach.

What We Can Learn from the Capital One Hack

2018, Solarwinds “Sunburst” Attack Due to Misconfiguration

In a 2020 SEC filing, Solarwinds stated that it was made aware of a cyberattack that inserted a backdoor into its monitoring software used extensively by US Federal agencies. The attack was ongoing in January 2019 so can be assumed to have begun in 2018 or before.

2017. Amazon S3 Outage

Bottom line: The system went down because of a typo.  A small mistake that only took 4 hours to track down, but one that had a huge impact on customer checkout processes. The damage was significant, impacting both profits and reputation.

After the Retrospective: Amazon S3 2017 Outage

2016. Dyn suffers the biggest DDoS attack in History

The attack happened in three waves and was no doubt one of the biggest DDoS attacks in history. This major backbone provider was overwhelmed, leaving internet users unable to access partner platforms such as Twitter, Spotify, etc.

DDoS Attacks on DYN  

2015. United Airlines Grounded for 2 Hours

A router misconfiguration caused a major business impact to airline giant, United Airlines in July. After grounding 90 aircraft at US airports for over 2 hours, United experienced flight and customer disruption as well as negative reputational impacts.

To Err is Human; To Automate, Devine

2014. Time Warner Cable Internet Outage

During an overnight network maintenance activity an erroneous configuration was propagated throughout their national backbone, resulting in a network outage.

Here's What Caused Time Warner Cable's Massive Internet Outage

2013. Misconfiguration Strikes Again, Setting Off Google Apps Outage.

The problem, which lasted for about three hours on Wednesday morning, occurred when the main user authentication system for Google applications was misconfigured. 

Google outages this week blamed on sign-in system 

2012. Critical Change Leaves Facebook Out of Reach.

Facebook went down due to a change made to the infrastructure. In complex dynamic ecosystems, such as Facebook's IT infrastructure, change happens a lot. On any given day, infrastructure is being upgraded, patches are being installed, automated processes are running that alter files, and system environments and configurations are also manually being changed. Sometimes these activities are performed correctly and ... sometimes they're not. 

Facebook Down Following Infrastructure Change

fboutage-image

2012. Merging United and Continental Computer Systems Grounds Passengers.

In one of the final steps involved in merging the two airline companies, United reported technical issues after its Apollo reservations system was switched over to Continental's Shares program. United struggled through at least three days of higher call volumes after the meshing of the systems and websites caused problems with some check-in kiosks and frequent-flier mileage balances. The glitch was another in a long string of technology problems that began In March.

The United/Continental Merger: System-wide Outages, Handwritten Boarding Passes 

2012. GMail Crashes Following Configuration Change.

The outage, according to Google, has been attributed to Google's Sync Server, in relying on a component to enforce quotas on per-datatype sync traffic, failed. The quota service "experienced traffic problems today due to a faulty load balancing configuration change." 

Worldwide Gmail, Chrome crash caused by sync server error 

2011. Amazon outage sends prominent Websites offline, including Quora, Foursquare, and Reddit.

Amazon released a detailed postmortem and mea culpa about the partial outage of its cloud services platform last week and identified the culprit: A configuration error made during a network upgrade. 

Amazon cloud outage was triggered by configuration error 

amazon-cloud-outage-image

2010. Massive failure knocks Singapore's DBS Bank off the banking grid for seven hours.

A faulty component within the disk storage subsystem serving the bank's mainframe was resulting in periodic alert messages, which saw a scheduled job to replace it at 3 a.m. that fateful day. The situation spiraled out of control as a direct result of human error in the routine operation. 

Massive Bank Failure Due to Human Error, IBM Blamed 

2009. Widespread trouble with Google Apps service.

Google Search and Google News performance slowed to a crawl, while an outage seemed to spread from Gmail to Google Maps and Google Reader. Comments about the failure were flying on Twitter, with " googlefail" quickly becoming one of the most searched terms on the popular micro-blogging site. 

Google Blames Outage on System Error

amazon-cloud-outage-image

2008. Gmail outage lasts about 30 hours.

The first problem reports started appearing in the official Google Apps discussion forum around mid-afternoon Wednesday. At around 5 p.m. that day, Google acknowledged that the company was aware of a problem preventing Gmail users from logging into their accounts and that it expected a solution by 9 p.m. on Thursday. 

Gmail Back After 30 Hours Down 

2007. Skype is down.

Skype advised that their engineering team had determined that the downtime was due to a software issue, with the problem expected to be solved "within 12 to 24 hours." 

Skype Suffers Major Outage 

2007. Some Amazon EC2 customer instances were terminated and unrecoverable.

A software deployment caused management software to erroneously terminate a small number of user instances. When monitoring detected this issue, the EC2 management software and APIs were disabled to prevent further terminations. 

Amazon Web Services Outage Takes Out Popular Websites Again 

2005. Faulty database derails SalesForce.com.

A Salesforce.com outage lasting nearly a day cut off access to critical business data for many of the company's customers in what appeared to be Salesforce's most severe service disruption to date. 

Salesforce outage angers customers 

2004. Unscheduled software upgrade grinds UK's Department for Work and Pensions to a halt.

Some 40,000 computers in the UK's Department for Work and Pensions (DWP) were unable to access their network last month when an IT technician erroneously installed a software upgrade. 

IT upgrade caused software glitch at UK agency

eds-image

2003. Glitch in upgrade of AT&T Wireless CRM system causes a break.

AT&T Wireless Services Inc. this week faced the software nightmare every IT administrator fears: An application upgrade last weekend went awry, taking down one of the company's key account management systems. 

Upgrade glitch downs AT&T Wireless' CRM system 

2002. New high-bandwidth application triggers outage.

The network grew very quickly due to business changes and was never redesigned to cope with the much larger scale and new application requirements. 

Hospital Brought Down by Networking Glitch 

2001. For 12 hours, TD Canada Trust's 13 million customers couldn't touch their money

TD Canada Trust ran full-page apologies in newspapers across the country Monday, saying it was sorry for a weekend computer crash that left millions of its customers unable to access their accounts. The bank said the outage was caused by "a rare and isolated hardware problem". 

TD Canada Trust apologizes for system outage

cbs-news-image

2000. Software-upgrade glitch leaves flights on the tarmac.

The Federal Aviation Administration (FAA) had called a halt on all flights scheduled to land at or depart from Los Angeles International Airport for four hours that morning. Technicians loading an upgrade to radar software at the Los Angeles air-traffic control center caused a mainframe host computer to crash. 

Don't let this be your organization. Contact Evolven today to see how we can help you get in front of misconfigurations, unauthorized changes, and more.

About the Author
Kristi Perdue
Vice President of Marketing