Notes From The Trenches: Obstacles and Challenges To IT Environment Stability
Recently I met with Avi Diller, a test consultant for Test Environment Management, and we discussed his experiences and the challenges he has encountered in running a pre-production environment for a large financial institution.
Martin: What is your experience in IT environment management?
My experience is managing work environments, particularly test environments for large financial institutions. The main environment I am referring to here is pre-production, providing a work environment to the release testers, the release developers - and needs to function with a minimum amount of downtime. The environment consists of hundreds of servers and workstations, and focuses on testing and development. I have seen that investment in the management and coordination of a test environment infrastructure that resembles a production environment can reduce the number of release failures in production by as much as a third.
Martin: Does your environment deal with a wide variety of technologies?
Yes, I am familiar with and manage multiple environments that consist of a wide range of different technologies just about everything - just name it. And there are systems that span all of these platforms.
Martin: What is your release schedule like?
Since we are talking about hundreds of systems, that means releases for all of these systems. To keep this going, we require maximum reliability, stability, and flexibility in the environment.
- Reliability: we have to maintain uptime, and keep the environment accessible.
- Stability: keep failures to a minimum, and keep this a solid working environment.
- Flexibility: Unlike the production environment, the pre-production environment needs to handle release after release, until we have a final working release that we will then push out to production. There are many releases that are checked and will go back to the developer, and not to production. So the environment usefulness requires versatility.
Martin: What types of incidents do you encounter?
In my experience, we mainly encounter mis-configurations, like a parameter that points to a place that is not relevant to the production environment. An interesting figure indicates that, according to Forrester, up to 60% of application downtime is cause by mis-configurations and application service errors. There are also some integration issues, where the systems need to be checked end to end, not just in the current environment. They need to be tested in larger environments, to see how they perform in conjunction with other systems and other databases. You need to reduce risk. Simply reduce it.
Martin: And what were the results of incidents?
The main consequence from these kinds of incidents can usually be that the release doesn't work. That could produce a full systems crash in the production environment, where the release either stops functioning immediately, or when it gets to a certain point of operation. If we translate this scenario to our real world terms, let’s say for a bank, then this could result in a bank branch opening in the morning and discovering that their computers can’t access critical bank information. So basically, a problematic release can really screw up the functioning of a bank and its branches, resulting in major financial losses.
Martin: What is the scope of incidents?
The impact of incidents is downtime that lasts up to a day or two, and minimally at least several hours.
Martin: What accounts for the downtime?
Most of the time is spent on incident investigation, hands down. When you have a problem in your system, you know that it isn't working, but you have no idea why it isn't working. That can be very frustrating.
Martin: What is the investigation process?
You start an investigative process by checking with users, you check other things, but really you don't have the full picture in order to know why it’s not working or simply to put your finger on 'what changed'. The difference between releases can be seen still at the pre-production stage, or for a release in the production environment. The investigation process has the developer sitting down with an infrastructure guy and they start to define tracers, to read logs, to track the system to see that everything is functioning as defined. Then you continue and try to figure out, what changed, when you are really not certain about what changed.
Martin: So you have a lot of ‘try and fail’ during the investigation process?
Absolutely, investigation is one big testing game. First you test what you know for sure, you do it, you run the tests, and again you get a negative. Run the test, again a negative response. You keep on trying and failing until, well, until something comes up. The investigation time is what takes the most amount of the downtime.
Does downtime set back your production environment?
>> Learn how Evolven reduces downtime, and cuts investigation time.About Avi Diller
Avi Diller is a consultant for Test Environment Management. His solutions offers comprehensive solutions for test environment management focused on the particular needs of IT in the financial sector, by combining best practices, processes and best of breed technologies. Avi Diller brings extensive experience in the domain of IT environment management, with many years working at leading financial institutions.