1 (866) 866-2320 Straight Talks Events Blog

Incident Management in Hybrid Cloud Environments

Blog

Incident Management in Hybrid Cloud Environments

About

This content is brought to you by Evolven. Evolven Change Analytics is a unique AIOps solution that tracks and analyzes all actual changes carried out in the enterprise cloud environment. Evolven helps leading enterprises cut the number of incidents, slash troubleshoot time, and eliminate unauthorized changes. Learn more

Balancing Risk, Cost, and Resolution Efficiency

Jason Bloomberg, Intellyx

Problem resolution efficiency (PRE, or sometimes simply resolution efficiency) is an important key performance indicator (KPI) for operations teams.

Equal to the ratio of the number of resolved problems to the number of raised problems in a given time interval, resolution efficiency gives managers a way of ensuring the proper allocation of ops resources while also providing sufficient attention to continuous improvement efforts.

To raise resolution efficiency, the ops team must either increase the number of resolved problems or decrease the number of problems it raised in the first place. As with other KPIs, however, it’s easy to monkey with the numbers by intentionally avoiding raising a problem.

Making judgment calls about whether to raise a problem can impact the resulting resolution efficiency – without actually improving the resolution of problems.

Ignorance is no Solution

Historically, operators have ignored alert storms – excessive or redundant indicators of issues that swamp operators’ ability to address them. Making resolution efficiency a priority exacerbates this problem, as the fewer raised problems, the better the KPI.

Modern observability tooling addresses many alert storms – but even today, too many alerts lead to a ‘crying wolf’ situation, where operators simply tune out the noise. Furthermore, the more potential problems in the operational environment, the greater the risk that operators will simply not pay attention to some of them to keep their numbers up.

Hybrid cloud environments – especially when they include cloud-native deployments – are at particular risk of such shenanigans. These environments are so dynamic and fast-moving that potential problems abound.

Organizations can no longer afford to prioritize ratios that fail to address this risk-laden behavior. The solution to this challenge is proactive incident management that drives sufficient consideration of each problem, prioritizing potential problems based on the risk they present to the business.

The Challenges of Issue Prioritization

Even with modern observability tooling, it’s impossible to address every issue such tools might identify.

In today’s complex operational environments, it fundamentally doesn’t make economic sense to fix every problem. Some incidents aren’t that serious, and the cost (in both money and time) to address incidents can vary widely.

And yet, operators cannot afford to simply ignore some issues. They need to make the right judgment call when the risk priority and the cost of resolution conflict. They must answer questions including:

Are problems that aren’t on the critical path a waste of resources? Issues on the critical path can potentially derail an entire application. If a problem isn’t severe, resolving it might not be worth the trouble.

Does it always make sense to fix small problems before they become big? One of the benefits of proactive issue resolution is the ability to catch problems early. But how serious will a small problem today become if operators don’t address it promptly? If a small problem remains small, then it should be a lower priority than other small problems that present higher long-term risks.

What data are most valuable to operators for making such judgments? Of all the telemetry data at their fingertips, how do operators determine which information they should pay attention to?

How can operators go from firefighting to proactive incident resolution while addressing risk issues? If the number of reported incidents is too high, the resolution efficiency will suffer, indicating a continual firefighting crisis mode. Any effort to lower that number, however, must respect the increased risks that come with the under-reporting of incidents.

Proactive incident management is the key to answering these questions.

How Proactive Incident Management Should Improve Resolution Efficiency

In the first article in this series, I explained how proactive incident management based upon configuration risk intelligence lowers the risk inherent in configuration changes, which are the primary root causes of many operational incidents.

Identifying configuration changes before incidents occur can lead to early remediation of potential issues, often before they become problems that impact users – or at least, the remediation of minor issues before they become major ones.

The challenge, therefore, is accurately evaluating the potential risk that configuration changes present – even though the problems that might result are difficult to measure proactively, especially in cloud native, hybrid environments subject to ongoing, rapid change.

Evolven addresses this challenge in two ways. The first approach is contextual insights.

Contextual insights leverage proactive incident management to identify root causes of potential issues – in other words, which configuration changes are more likely to lead to problems in the future.

Evolven builds correlated contexts for each change, prioritizing the change according to the particular use case – incident prevention, lowering compliance risk, or identifying root causes of issues, for example.

The second approach that Evolven brings to bear is Generative AI with Evo. This GPT-based chatbot enhances teams’ real-time interactions with Evolven data via a seamless, conversational interface.

Evo with Evolven data facilitates more efficient workflows, allowing immediate access to information and insights.

As a result, Evolven streamlines the process of identifying potential issues, thus reducing the mean time to resolution (MTTR) – and increasing the number of resolved incidents in each period.

Remember that resolution efficiency is the ratio of resolved incidents to identified ones. Reducing the number of identified incidents is one way to improve this number, and increasing the number of resolved incidents is the other.

The Intellyx Take

Operations teams must mount a two-pronged attack to improve their resolution efficiency: raising the number of resolved incidents while lowering the number of reported ones.

MTTR is an important KPI in and of its own right. By leveraging proactive incident management and the power of generative AI, Evolven can improve resolution times, even in dynamic, diverse hybrid environments.

Evolven also addresses the second prong as well. By evaluating the risks inherent in each configuration change within the appropriate business context, Evolven enables operators to target the right problems to manage risks and costs – with no shenanigans.

Copyright © Intellyx BV. Evolven is an Intellyx customer. Intellyx retains final editorial control of this article. No AI was used to write this article.

About the Author
Jason Bloomberg

Jason Bloomberg is founder and managing partner of enterprise IT industry analysis firm Intellyx. He is a leading IT industry analyst, author, keynote speaker, and globally recognized expert on multiple disruptive trends in enterprise technology and digital transformation.He is #13 on the Top 50 Global Thought Leaders on Cloud Computing 2023 and #10 on the Top 50 Global Thought Leaders on Mobility 2023, both by Thinkers 360. He is a leading social amplifier in Onalytica’s Who’s Who in Cloud? for 2022 and a Top 50 Agile Leaders of 2022 by Team leadersHum.Mr. Bloomberg is the author or coauthor of five books, including Low-Code for Dummies, published in October 2019.