The Kung Fu of Change Risk Intelligence
“I seek not to know the answers but to understand the questions.” - Kwai Chang Caine, AKA grasshopper in Kung Fu TV Series
“Those who are unaware they are walking in darkness will never seek the light.” - Bruce Lee, Martial Arts Master, and Creator of Jeet Kune Do
Kung Fu is the phrase used as the name for Chinese Martial Arts, although its original meaning refers to any discipline or skill achieved through hard work and practice. I&O (infrastructure and operations, inclusive of DevOps) Monitoring and Observability takes hard work and practice (and truth be told may involve some martial arts lite punching and kicking…although only after hours and only to the furniture). Monitoring and Observability of the hybrid cloud is especially challenging as the elements and configurations that comprise these items are ephemeral, constantly changing, and evolving. Bruce Lee said, “those who are unaware they are walking in darkness will never seek the light”. Unfortunately, in many cases that is exactly what is happening. Deep configurations are often invisible “in darkness” and those in I&O (and DevOps) charged with maintaining production reliability are caught unaware when customer-impacting incidents occur because of these changes in configuration. Furthermore, moving to the ultimate practice of incident prevention requires that monitoring and observability teams do as Kwai Chang Caine said, “understand the questions”. In this case, the questions are “what are the deep configurations”, “how have they changed”, and “what risk is incurred by these changes”.
Walking in Darkness
The migration to the cloud has become more of a residency or a realization of the migration plans enterprises have been working on for years. According to the recent 2022 DevOps Pulse Survey over 44% of their survey respondents are “fully migrated” to the cloud, with an additional 34% “partially migrated”. In addition, “Gartner analysts said that more than 85% of organizations will embrace a cloud-first principle by 2025 and will not be able to fully execute on their digital strategies without the use of cloud-native architectures and technologies.” The problem is no longer “how do I get there”, but more of “how do I manage once I am there”.
The rationale driving cloud migration has always been one of reducing costs while at the same time improving agility. The good news is “you have migrated to the cloud”, but the bad news is “everything is much harder” due to the complexity, ephemerality, and volatility inherent in cloud-native architecture, hybrid cloud architecture, and tooling. This situation is driven by microservice architecture and perhaps, not surprisingly compounded by the intricacy of the new observability tools.
As a result of this, the desired goals of cost reduction and agility have paradoxically resulted in less visibility and increased difficulties in managing monitoring cost, most specifically in modern application architectures using microservices and Kubernetes. 64% of DevOps Pulse survey respondents in 2021 reported that their MTTR production incidents taking over an hour to resolve had increased significantly over the prior year. This is an unfortunate statistic as reducing MTTR is one of the foremost goals of DevOps, especially in enterprises that employ a “you built it, you run it” approach. It is suspected that the cases where MTTR is increasing are for the more complex scenarios where major incidents are occurring
Observability tools designed to address the architectural complexities abound, however, many depend on the usage of distributed tracing and logging which in the case of application observability (perplexing that observability by its very nature isn’t pervasive) requires development-level instrumentation efforts and is not yet widely deployed, (the nascent standard-in progress of OpenTelemetry notwithstanding). Awareness of its importance is there; however, implementing it is very challenging due to its complexity and the shortage of individuals with observability tool expertise, not to mention confusion in the market over what observability is really meant to achieve.
Today, the problem of tool sprawl in the cloud persists with many enterprises using multiple observability and monitoring tools and thus, unable to achieve the “desired single point of truth” requisite for “stepping out of the darkness” and delivering cost-effective, deep visibility across their cloud deployment. This plethora of tools also results in an overabundance of data volumes ingested by the tooling which are expensive to maintain. These conditions contribute significantly to why both MTTR and the Total Cost of Ownership (TCO) for monitoring are increasing instead of moving in opposite directions. And while the adoption of DevOps has greatly increased, its goal of reducing toil has not been achieved and is in fact increasing.
Seeking the Light
As Martial Arts Master, Bruce Lee recommended “…seek the light”. Light engenders visibility. Here that means greater visibility of the risks in changes to hybrid and cloud-native configurations including containerization, Kubernetes, serverless applications, and the implications these changes have for security.
Handling unknown unknowns is often posed as the “raison d’etre” or justification for observability tools. However, hiding in plain sight is an unknown that should be known, “configuration”. Visibility of configuration and its inclusion as input to decisions in deployment, problem avoidance, problem management, and other key DevOps activities are either considered at a surface level or not at all. The nominal “single source of truth” for guidance in these tasks, the CMDB simply does not have all the information necessary.
The CMDB was never designed to store the enormous volumes of data that tooling would generate from the deep configuration data in dynamic cloud-native frameworks such as Kubernetes, nor shine the light on what needs doing to avoid risk. For example, configuration data such as that from Kubernetes clusters can impact issues on containers, pods, controllers, control plane components, and more. Since Kubernetes (or K8 as it is abbreviated) is often used to construct Microservice applications, the impact in terms of behavior, performance, and security of a change in one of its component’s configurations can be enormous and the challenges in troubleshooting it will be daunting.
Understand the Questions – Change Risk Intelligence
As Kwai Chang Caine said, “I seek not to know the answers but to understand the questions”, we need to do the same to ensure stability, reliability, and security. These answers will deliver “change risk intelligence”, whose questions should be preemptively asked leveraging automation before deployments are made and before service interruptions occur. Consideration (or intelligence) about configuration changes should be made in all decisions impacting the customer experience. This intelligence is acquired from consulting a repository of configuration and change data that includes the dynamic information in the cloud. This requires a cloud-specific repository outside the CMDB yet working in coordination with it. The usage of configuration as code can help but this approach still requires a repository to store this data in so that it is accessible to queries, analysis, and automation. Today, there are specialist tools that do capture deep, cloud-native configuration data; however, they perform this in a siloed manner specific to an individual technology stack and do not enable a comprehensive, proactive determination of the effects of configuration and change.
This pervasive repository of configuration and change should at a minimum support ITIL 4 change verification, but it needs to do this without the time-consuming manual step of waiting for a CAB (Change Advisory Board) meeting and instead leverage automation to move at the speed of the cloud or forever risk being too little, too late or just plain wrong.
The requirement of speed (and accuracy) requires the usage of AI-driven automation to process large volumes of data in time. This becomes a much more useful application of so-called AIOps than today’s usage of it for event correlation.
Creating a feedback loop between this analysis and the DevOps CICD process will help address the vexing problem of why automation adoption has increased and yet reliability has not.
The great shift left of responsibility for production reliability from IT Ops to DevOps has been promising. However, it has yet to deliver everything the business needs for digital transformation. This “sea change” still can deliver on this promise by the intelligence acquired through considering the risk of change and configuration as part of its automated processes for delivery as well as its proactive involvement in problem management.
DevOps can better leverage AI to proactively recognize trends and anomalies that will impact customer experience before they become business-impacting incidents. While the adoption of automation is on the rise, feedback loops between production stability and the “configuration” changes DevOps is making must be included in this automation for continuous improvement that results in greater reliability.
While the phrase “a single source of truth” has been given lip service for years, there are still multiple, siloed stores of operational data, often technology-specifics used by development, and operations. This needs to be consolidated finally and must be done in a manner that is unhampered by manual processes using AI to keep up with the need for speed the cloud demands.
Observability tools have enhanced what traditional monitoring tools can provide for microservice applications. Nevertheless, they must move from being technology specific to becoming more technology agnostic and simplify their usage to address the shortage of expertise. They also need to evolve from just collecting logs, metrics, and traces and link their analysis to business requirements.
Configuration risk intelligence must be the goal of DevOps as they seek to provide stability, reliability, and security that the business needs to complete its digital transformation.
-Charley Rich (former Brown Belt in Kempo Karate)
Contact Evolven here to see the Evolven Change Control technology in action.