Scoring A Risky Business
Configuration is a Risky Business
A young Tom Cruise led the cast in the 1983 “Risky Business” movie. Portraying a teen looking for fun at home while his parents were away, he quickly let the situation get out of hand resulting in exactly what you would expect--absolute mayhem. Perhaps only outdone and outscored in risk by Ferris Bueller, Tom’s unauthorized changes to the rules set by his parents created a very risky misadventure for him, his friends, and his parents. As this was movie land, the impacts of their misconfigured weekend were resolved, and the stories were reconciled, all before two hours were up. If only…
But in real life, it doesn’t work that way. Unanticipated, unmeasured risks in the world of IT (Information Technology) Infrastructure and Operations (I&O), misadventures if you will, can have disastrous consequences for the stability, compliance, and security of your enterprise. These “risks” can be damaging without a single shared end-to-end view of configurations and the changes made to them. Lacking that view, you will be unable to anticipate risk and take action to prevent impact.
Limit your Risk
You might attempt to avoid all risks by staying indoors, but that would be no fun. Instead, avoid the misadventures and consequences of risky configurations by employing a process to score the risk they produce. Utilizing an approach of Configuration Risk Intelligence will provide the converging development, operations, and security teams with a single shared view of configurations and change, end-to-end. This methodology provides a risk-based view enabling DevSecOps teams to mitigate the undesired and unexpected impact of configuration setup and modification. This practice identifies all digital configuration assets, baselines them, assesses their risks as they change, and leverages automation initiating action to address risk before there is an impact.
Examples of risky changes include the following situations:
- An enterprise learns that its deployments through its CI/CD pipeline utilizing Jenkins kept failing. As part of the update, they were pushing a change to Amazon ECS that configured the autoscaler instances from one to two. Two is the maximum it can be configured for. But unfortunately, when a rolling update was attempted, there wasn’t sufficient “headroom” available for the new instance to spin up, and thus it failed.
- A different enterprise had tickets open for Web application errors for some but not all users. The initial analysis reported (incorrectly it turned out) that no changes had been made. After further analysis, it was uncovered that the NGINX configuration was updated and the “max_conns” parameter was changed from 0 to 50. 0 is the default and means “no limit” to the number of connections. With this misconfiguration, after 50 users were connected it no longer accepted connections.
In order to respond to risk and prioritize, you need a scoring system that automates the analysis of risky changes and configurations. Analysis of configuration risk investigates situations resulting from the detection of unauthorized configuration changes, those not approved by a change request submitted to tools implementing Service Asset and Configuration Management (SACM), as well as those in an approved, automated deployment. Two data sets are analyzed as part of this process: configuration risk and change risk. Configuration and changes are two sides of the same coin and must be reviewed and analyzed together, as shown below:
- Configuration Risk – The probability that a negative outcome will result from a configuration that is non-compliant with policy, is inaccurate, inconsistent, or introduces exposures.
- Change Risk – A change that increases the likelihood that an update to a configuration parameter will lead to an issue.
Keep It Relevant
When analyzing these risks, you should consider a third category, called a “relevant change”, where an update of a configuration parameter matches explicit user interest or is returned from a search. This would flag the change as “of significance” and limit the investigation scope to a more specific entity such as an individual application or service.
Risk can also be analyzed and then categorized as “possible risk” meaning that there are signs of anomalous values and actions; however, it is not certain that this risk will result in a negative outcome. The process for scoring is shown in figure 1.
Figure 1: Scoring Process
Risks categories include the following types:
- Operational: anomalies and drift patterns in configuration parameter values
- Compliance: adherence with standards such as GDPR, HIPAA, NIST, and others should be evaluated
- Vulnerability: checking against vulnerability databases both external and internally compiled
- Release: investigation of artifact and developer anomalies, new code
- Availability: analyzing whether a change will cause an impact on system stability
- Performance: historical analysis to predict if a change will impact performance
- Security: anomalous, unusual changes or configuration state
Use Artificial Intelligence to Automate Risk Scoring
Artificial Intelligence (AI) is utilized to determine the impact of uncovered risks. AI is applied to relate topology, telemetry, and configuration change to establish the perimeter of the affected area expected to suffer impact. Augmenting automatic analysis, user-defined policies can be applied to items in the configuration asset inventory to catch from a site-specific perspective “what doesn’t appear right”.
These capabilities endow configuration risk intelligence methodology with the ability to enable you to become more proactive and act to circumvent the impact of problems resulting from risky changes and configurations.
To complete the lifecycle of risk analysis and deliver continuous improvement, a feedback loop is employed to track what “actually” happened - the outcomes of risky changes. From this feed, a historical database is created, continuously updated, and utilized to further train and improve the effectiveness of the relevant AI algorithms. The historical database as part of a Digital Twin environment based on infrastructure as code (IaC) is utilized to predict future outcomes.
Risk Analysis Must Be Defensible
The product of risk analysis should be 100% defensible, meaning that through manual processes any subject matter expert (SME) would come to an identical conclusion, albeit much slower and perhaps too late. Thus, the analysis of which changes or configurations are risky is certain and can be acted upon via automation. Examples of a defensible risk could include the determination of an unauthorized change or a broken policy. Using automation to remediate the risk will speed up the process, reduce toil, and is likely to ensure higher-quality customer experiences.
Analyze Risk as Part of the SDLC
Integrate a configuration risk gateway into the DevOps CI/CD pipeline and manage risk as early as possible in the software development-deployment lifecycle. The Risk Gateway will assess risk in the configuration and changes going through the pipeline. Alleviating risks prior to deployment reduces toil by preventing production impact.
The Risk Gateway is used for example to reveal risks such as mismatched configuration parameters between environments, a new developer contributing to code, a change in commit frequency (commit storm), commits with frequent test failures, incident rate change, and more. The gateway is used as a verification process whereas an automated deployment will only proceed if risks are analyzed as low. The risk gateway will enable a what-if analysis as a method to provide collaborative guidance.
Risk level should be calculated based on known risks associated with configurations from a knowledgebase, user-defined risk, value anomalies, frequency time, concurrent risks, and IT context. An automatic risk analysis should highlight only those potential misconfigurations and drift that can hurt the environment’s stability, compliance, and security
Based on the risk score, automated actions can initiate that mitigate the risk, for example, change remediation, reconciliation with other tools, repaving of environments, additional review, or halting a deployment before it is pushed to production.
Configuration is a risky business, but the risk can be managed. Configuration Risk Intelligence methodology of scoring risk makes it possible to automatically determine the current and future impact of changes and configurations. This process will uncover misconfiguration, vulnerabilities, blind spots, outdated policies, expired certificates, drift, unauthorized changes, issues in file integrity, and more. Scoring these risks enables prioritization and automation, essential factors in rapidly delivering applications and services, evolving, and improving products to serve customer requirements better.
In the Risky Business movie timeline, Tom Cruise is able to manage his risk in several days; however, it took a whole lot of “heavy lifting” and magic that can only happen in Hollywood. Configuration Risk Intelligence enables you to manage the risk in your hybrid, multi-cloud environment using AI and automation without requiring the magic of Hollywood and from this continuously improve the value you deliver to your customers.