The Key to Proactive Incident Management: Understanding the Risks Behind the Data
Managing incidents in any operational environment, no matter the complexity always boils down to this simple, three-step process:
Identify a problem ? Figure out what caused the problem ? Fix the problem
This reactive process is so ingrained in the way operators think about incidents that entire product categories have grown around it, from traditional IT incident management to AIOps to observability.
Nevertheless, this ‘see a problem, fix the problem’ approach is fraught with challenges. The good news is that there’s a better way: start with the causes and predict the effects – in other words, take a proactive approach to incident management.
Problems with Reactive Incident Management
To identify problems, operators first look to observability tooling. Observability provides telemetry in the form of logs, traces, and metrics – vital information about the behavior of various systems and applications.
If there’s a problem, it should turn up in these observability data. In other words, observability provides insight into the effects or symptoms of a problem, not its causes.
Managing and mitigating incidents thus begins with these effects in the hope of uncovering and fixing the causes of a problem. Such reactive approaches suffer from the following limitations:
- They are firefighting, not fire prevention – incident analysis can only begin after the incident takes place. Observability data don’t provide information about issues that haven’t taken place yet.
- They take time – the incident management team can only work during the period that the problem is occurring. The longer they take, the worse the problem becomes.
- It’s difficult to uncover causes when there are more than one – root cause analysis techniques that follow traces back in time to their source are reasonably good at finding single causes of problems. When more than one cause is at fault, however, traditional approaches often identify only one of them.
- The purported causes may not be the actual root cause – Cause and effect occur in chains, with one cause leading to an effect that in turn is the cause of another effect, and so on. Following this causal chain backward often hits a dead end at an intermediate cause without going all the way to the root cause.
Reversing the Incident Management Process
The better approach is to start with potential causes of problems to identify issues that may not have occurred yet.
Instead of starting with observability tooling, this proactive approach begins with configuration risk intelligence.
The fundamental principle at work here is that if there’s a problem today that wasn’t there yesterday – look for what’s changed. Better yet, look for what’s changed ahead of time.
In other words, keep track of all the configuration changes in the entire IT estate as a basis for establishing the causes of any problem – even before such problems occur.
We can represent this proactive process as:
Identify all configuration changes ? analyze which ones will likely lead to problems ? fix relevant misconfigurations before such problems occur.
Following this process addresses the limitations of the reactive approach above because it is:
- Proactive – the incident management team works to resolve issues before they impact users of the systems and applications.
- Fast – Instead of waiting for hidden problems to grow into serious issues, the team can work on those problems when they are still relatively minor.
- Identifies potential problems with more than one cause – Because this approach starts with the causes to predict effects, it’s simple to correlate different causes that might work together to bring about a particular effect.
This proactive, risk-driven approach to incident management hinges on understanding the potential risks of various configuration changes and analyzing how these changes can evolve into incidents.
Operators must recognize that certain configuration changes in the IT landscape pose greater risks than others. It is essential, therefore, that they consider the relevant configuration risk intelligence to assess such changes’ impact on stability, security and compliance.
This assessment should consider how these configurations are reflected within the system’s architecture, how they connect various components, and whether they adhere to established IT processes. Additionally, it’s important to evaluate such changes for anomalies, drifts, or unexpected deviations from normal parameters.
Configuration risk intelligence, therefore, is the key to the entire proactive approach to incident management.
With a configuration risk intelligence solution like Evolven, organizations can use configuration data to track and analyze configuration changes across environments, identifying those changes that are more likely to lead to issues so they can address them before they lead to problems that affect users or the business.
The Intellyx Take
Evolven leverages several automatic, patented change and risk-based algorithms, as well as the power of AI, to calculate the risks inherent in configuration changes.
Such risks fall into three broad categories: reliability risk (the risk of downtime or other adverse effect that impacts users), cybersecurity risk (the risk of a compromise), and compliance risk (the risk of failing to comply with regulations or other policies).
Proactively addressing all types of risks is important. For the risk of compromise, in particular, getting ahead of attackers is absolutely essential.
Misconfigurations, after all, aren’t always accidental. Sometimes they’re malicious. Unless your organization is evaluating the risks of all configuration changes across the board, you’re likely to miss the malicious ones – until it’s too late.
Copyright © Intellyx BV. Evolven is an Intellyx customer. Intellyx retains final editorial control of this article. No AI was used to write this article.