Improving IT: Visible Ops
August 1, 2010
An IT organization can be made up of the best people around, supported by a state of the art infrastructure with blazingly fast networks and top of the line servers and workstations and still have problems with servers crashing and systems running slowly. The best way to alleviate this type of situation and regain IT control is to implement Information Technology Infrastructure Library. While ITIL has a reputation for being complex, intrusive, unformulated for environments, and taking too long to implement, its only real shortcoming is that it does not react quickly enough to change. In order to best avoid this and other common complaints with ITIL, there are four steps that will help alleviate frustration.
The first step in the Visible Ops process is to stop making changes to your environment. A doctor would not think of performing major surgery on a patient without first stopping all of the peripheral bleeding that was occurring. The same approach should be taken in an IT department. Many times, people will push for changes or one last "hotfix" to resolve the surface problem, but you cannot figure out what systems are causing outages until you freeze the changes that are occurring on all systems. By stopping all modifications, a stable platform is established which makes it possible to find which system(s) is the problem. Once the environment is stabilized, it is possible to proceed.
Similar to a naturalist who goes into the field and identifies, tags, and categorizes healthy animals and quarantines or isolates injured ones, an IT expert must take inventory and identify each device, server, and system in the datacenter. During this inventory, all fragile artifacts (any system, computer, or device that is dangerously weak and inexplicably crashes) must be identified and tagged. Fragile artifacts are like the problem children of an infrastructure; they require hours or days to be resolved and consume the majority of an IT department's time.
The change freeze from step one should still be in effect, but take an extra precaution by grabbing a pad of nuclear colored post-its (the big size that takes up half a normal sheet of paper), a roll of duct tape and writing "DO NOT TOUCH!" and duct taping it to the fragile artifact (use as many as necessary to avoid any misunderstanding between those strolling through the datacenter looking for something to do and the Visible Ops team!).
After the inventory is complete, it will be easier to understand all the different types of systems that are in the infrastructure. The third step of Visible Ops is to create a build library for each type of device which can include automated server builds for DNS or application servers, configuration files for switches and routers, and scripts that automatically install higher level applications. Once the repeatable build library has been constructed, an outage of any of the systems in the library will have a replacement up and running quickly and efficiently, operating as a replacement to any fragile artifacts.
Fragile artifacts usually become unstable because too many systems were combined onto one piece of hardware (for example: if a server started out as a development web server and was later placed into production and consequently had share printers installed on it), so having a build library allows three separate servers to quickly replace the one fragile artifact. Because these servers will be built in a standard way with only one roll assigned, they will be inherently more stable and easy to troubleshoot.
The fourth phase of Visible Ops brings each of the previous processes together in a way that enables improved performance in each step and ensures enhanced delivery. Choosing the right metrics to measure what all requires improvement is critical to allowing continuous improvements and can be accomplished by determining:
- Release - how quickly and effectively is infrastructure provisioned?
-
- Time to provision a known good build
- Percent of systems that match known good builds
- Ratio of release engineers to system engineers (higher is better)
- Controls - how effectively are good decisions that keep production infrastructure available, predictable, and secure made?
-
- Number of changes authorized per week
- Number of actual changes per week
- Number of unauthorized changes
- Resolutions - when things go wrong, how effectively are issues diagnosed and resolved?
-
- MTTR (Mean Time To Repair)
- MTBF (Mean Time Between Failure)
These sample metrics will help organize and identify the improvement points across each step of the process and help to make the targeted changes.
The main goal of each of these steps is to provide four tangible benefits:
- To move the most experienced staff into pre-production engineering roles which allows the best people to identify problems with systems before they reach production, and to automate those systems so they are quickly provisioned once in production.
- To increase the amount of time spent proactively fixing problems instead of reactively putting out fires.
- To boost productivity by increasing change rates, change success rates, and the business value of changes.
- To keep closing the loop by using detective controls to carefully reduce variance (including configuration variance, variance between planned work and actual work, and variance between builds).
Once implementation of these changes is complete, a more formalized ITIL methodology emerges. The creation of a solid foundation ensures success of the IT department and any ITIL initiatives that are undertaken. This article is based on The Visible Ops Handbook by Behr, Kim, and Spafford (ISBN 0-9755686-1-2) which supplies incredibly knowledgeable advice between its pages and should be referred to to better understand the implementation of ITIL.
