This was a terrific read, outlining the playbook and principles behind managing your IT Operations to deliver increased value for your business and your customers.
The following are my favourite passages from the book The Phoenix Project by Gene Kim, George Spafford, and Kevin Behr; along with my own framework for laying out the concepts.
I. “The Three Ways”
The book’s main ideas are exposed by the fictitious character of Dr. Erik Reid who describes the Three Ways of designing and managing Development Operations (DevOps). The three ways focus on delivering continuous flow of value to the customer, incorporating fast feedback loops, as well as building a culture of continuous experimentation and improvement.
“The First Way helps us understand how to create fast flow of work as it moves from Development into IT Operations, because that’s what’s between the business and the customer.
The Second Way shows us how to shorten and amplify feedback loops, so we can fix quality at the source and avoid rework.
And the Third Way shows us how to create a culture that simultaneously fosters experimentation, learning from failure, and understanding that repetition and practice are the prerequisites to mastery.”
II. The First Way
“You must gain a true understanding of the business system that IT operates in. Go talk to the business process owners for the objectives. Find out what their exact roles are, what business processes underpin their goals, and then get from them the top list of things that jeopardize those goals.”
A few of the “First Wave” concepts explored throughout the book include:
- Kanban boards
- Batch Size
- Idle Time
- Capacity and Resources
- Production Control
- “A kanban board, among many other things, is one of the primary ways our manufacturing plants schedule and pull work through the system. It makes demand and WIP visible, and is used to signal upstream and downstream stations.”
- “In any system of work, the theoretical ideal is single-piece flow, which maximizes throughput and minimizes variance. You get there by continually reducing batch sizes.”
- “Now everyone knows that you don’t release work based on the availability of the first station. Instead, it should be based on the tempo of how quickly the bottleneck resource can consume the work.”
- “Eliyahu M. Goldratt, who created the Theory of Constraints, showed us how any improvements made anywhere besides the bottleneck are an illusion. Astonishing, but true! Any improvement made after the bottleneck is useless, because it will always remain starved, waiting for work from the bottleneck. And any improvements made before the bottleneck merely results in more inventory piling up at the bottleneck.”
Slack and Idle Time
- “Everyone needs idle time, or slack time. If no one has slack time, WIP gets stuck in the system. Or more specifically, stuck in queues, just waiting.”
Capacity and the Bill of Resources
- “What you’re building is the bill of materials for all the work that you do in IT Operations. But instead of a list of parts and subassemblies, like moldings, screws, and casters, you’re cataloging all the prerequisites of what you need before you can complete the work—like laptop model numbers, specifications of user information, the software and licenses needed, their configurations, version information, the security and capacity and continuity requirements, yada yada…” He interrupts himself, saying, “Well, to be more accurate, you’re actually building a bill of resources. That’s the bill of materials along with the list of the required work centers and the routing. Once you have that, along with the work orders and your resources, you’ll finally be able to get a handle on what your capacity and demand is. This is what will enable you to finally know whether you can accept new work and then actually be able to schedule the work.”
Production Scheduling and Control
- “We’re doing what Manufacturing Production Control Departments do. They’re the people that schedule and oversee all of production to ensure they can meet customer demand. When they accept an order, they confirm there’s enough capacity and necessary inputs at each required work center, expediting work when necessary. They work with the sales manager and plant manager to build a production schedule so they can deliver on all their commitments.”
- “You must figure out how to control the release of work into IT Operations and, more importantly, ensure that your most constrained resources are doing only the work that serves the goal of the entire system, not just one silo.”
III. The Second Way
“Now you must prove that you can master the Second Way, creating constant feedback loops from IT Operations back into Development, designing quality into the product at the earliest stages. To do that, you can’t have nine-month-long releases. You need much faster feedback.”
A few of the “Second Wave” concepts explored throughout the book include:
- Amplifying Feedback Loops through the Value Stream
- Creating Quality at the Source
- Enabling Fast Detection and Recovery
- Creating or Embedding Institutional Knowledge
Minimize Waste (Feedback Loops)
- “Whenever we see work go backward, that’s rework. When that happens, you can bet that the amount of documentation and information flow is going to be pretty poor, which means nothing is reproducible and that it’s going to get worse over time as we try to go faster. They call this ‘non-value-add’ activity or ‘waste.’”
Integrating Information Security (Creating Quality at the Source)
- ““How do you propose we do it?” We quickly agree to pair up people in Wes’ and Chris’ group with John’s team, so that we can increase the bench of security expertise. By doing this, we will start integrating security into all of our daily work, no longer securing things after they’re deployed.”
Situational Awareness (Fast Detection and Recovery)
- “We need to establish an accurate timeline of relevant events. And so far, we’re basing everything on hearsay. That doesn’t work for solving crimes, and it definitely doesn’t work for solving outages.”
Knowledge Capture (Creating Institutional Knowledge)
- “Every time that we let Brent fix something that none of us can replicate, Brent gets a little smarter, and the entire system gets dumber. We’ve got to put an end to that. “Maybe we create a resource pool of level 3 engineers to handle the escalations, but keep Brent out of that pool. The level 3s would be responsible for resolving all incidents to closure, and would be the only people who can get access to Brent. They’d be responsible for documenting what they learned, and Brent would never be allowed to work on the same problem twice. At the end of each incident, we’ll have one more article in our knowledge base of how to fix a hairy problem and a growing pool of people who can execute the fix.”
IV. The Third Way
“We need to create a culture that reinforces the value of taking risks and learning from failure and the need for repetition and practice to create mastery.”
A few of the “Third Wave” concepts explored throughout the book include:
- Continuous Improvement
- High-Trust Teams and Vulnerability
- Resilience Engineering
- Crisis Management and Pre-Mortems
Continuous Improvement (Repetition and Practice)
- “Mike Rother says that it almost doesn’t matter what you improve, as long as you’re improving something. Why? Because if you are not improving, entropy guarantees that you are actually getting worse, which ensures that there is no path to zero errors, zero work-related accidents, and zero loss.”
Five Dysfunctions of a Team (High-Trust Teams)
- “Solving any complex business problem requires teamwork, and teamwork requires trust. Lencioni teaches that showing vulnerability helps create a foundation for that.”
- In his model, the five dysfunctions are described as:
- Absence of trust: unwilling to be vulnerable within the group
- Fear of conflict: seeking artificial harmony over constructive passionate debate
- Lack of commitment: feigning buy-in for group decisions creates ambiguity throughout the organization
- Avoidance of accountability: ducking the responsibility to cal peers on counterproductive behavior, which sets low standards
- Inattention to results: focusing on personal success, status, and ego before team success”
Resilience and Chaos Engineering
- “Resilience engineering tells us that we should routinely inject faults into the system, doing them frequently, to make them less painful.”
- “John loved this, and started a new project called “Evil Chaos Monkey.” Instead of generating operational faults in production, it would constantly try to exploit security holes, fuzz our applications with storms of malformed packets, try to install backdoors, gain access to confidential data, and all sorts of other nefarious attacks.”
Crisis Management and Premortems
- “I want you to host practice incident calls and fire drills every two weeks. We need to get everyone used to solving problems in a methodical way and to have the timeline available before we go into that meeting. If we can’t do this during a prearranged drill, how can we expect people to do it during an emergency?”
V. Authors’ Closing Thoughts
“Imagine living in a DevOps world, where product owners, Development, QA, IT Operations, and InfoSec work together relentlessly to help each other and the overall organization win.
They are enabling fast flow of planned work into production (e.g., performing tens, hundreds, or even thousands of code deploys per day), while preserving world-class stability, reliability, availability, and security.”
I greatly enjoyed the read. The authors provide an insightful perspective on how multiple teams can come together to deliver fast flow of value and help the business win.
All content credit goes to the authors. I’ve simply shared the bits I’ve enjoyed the most and found most useful.
Cheers ’till next time!