Compile Flight Rules for Your Software Engineering Team

by Edmond Lau

Photo credit: Matthew Simantov

In his book, An Astronaut’s Guide to Life on Earth, Chris Hadfield shares the insights he learned from his seemingly impossible journey to become the first Canadian to walk in space. He tells stories about the realistic simulations he worked through to prepare for space, about his daily life on his 6-month mission in the International Space Station, and even about the mechanics of how astronauts brush their teeth in space.

What left a particularly strong impression on me was an amazing tome of knowledge that Hadfield describes, one that took humanity decades to produce. “NASA has been capturing our missteps, disasters and solutions since the early 1960s,” he writes, “when Mercury-era ground teams first started gathering ‘lessons learned’ into a compendium that now lists thousands of problematic situations, from engine failure to busted hatch handles to computer glitches, and their solutions.”

The compendium is known as Flight Rules. Amazingly, you can find an online copy of all 2200+ pages of Volume A of those rules. Given a particular set of circumstances, the manuals enumerate, step by step, what to do and why. Have a cooling system failure? Flight Rules tell you step by step how to fix it and the rationale for each step. Fuel cell issue? Flight Rules tell you whether the launch needs to be postponed. “They are extremely detailed, scenario-specific standard operating procedures,” Hadfield continues, containing all the lessons ever learned and distilled from past missions.

That so much valuable knowledge could be crammed into a manual made me wonder: What would a similar compendium at a software engineering company look like and what would having one let us do? Certainly, it would have step-by-step operational rules for our systems. MySQL database failure? Flight Rules would tell you how to fail over from the master to the slave. Servers overloaded from traffic overload? The rules would tell you which scripts to run to bring up extra capacity.

But it could also have more general patterns of how to run projects. Project falling behind schedule? Flight Rules could tell you what happened when different project teams in the past worked overtime, what those teams thought were the main contributors to their eventual success or failure, and whether team members burned out. Have an idea for a new ranking algorithm? You’d find in the Flight Rules a compilation of all the A/B tests run in the past, what the hypotheses were, and whether the experiments confirmed or rejected those hypotheses.

Such a resource seems like it would be immensely valuable, especially for new people joining the team or the company. So what would it take to get there?

The “cultural staple” at NASA is the mission debrief. Debriefs after a mission provide the opportunity to retrospect on the mission, to figure out what went wrong and what could have been done better. They’re also extremely intense. Experts fire barrages of questions, and every action is flagged and dissected for what went wrong. A 4-hour simulation might be followed by a 1-hour debrief. A space flight would be followed by a month or more of all-day debriefs. You have to steady yourself for feedback and remember that the goal is not to put shame to your mistakes but to maximize collective wisdom.

Given that it costs $450 million per mission to launch a space shuttle, 1 it’s not hard to understand why NASA found the full-scale debriefs necessary – each mistake is costly both in terms of taxpayer dollars and astronauts’ lives.

Most engineering companies don’t work on something as mission-critical as space flight. The demands of nimbleness and the lower cost of failure at many companies make NASA’s level of rigor infeasible, just like using The Lean Startup methodology isn’t the best approach for NASA.

But the focus on building collective wisdom into a set of Flight Rules is still a pattern that we could adopt more widely in software engineering.

Conducting engineering post-mortems – the analog to NASA’s mission debrief – is a fairly well-known practice. After a site outage, high-priority bug, or infrastructure issue, engineering teams follow up with a detailed writeup explaining what happened, how and why it happened, and what actions we can take to prevent it. If the situation is not preventable, building a tool to make recovery easier or writing a step-by-step document to explain how to handle the situation next time is a reasonable alternative. Many teams already tend to do this, with varying degrees of rigor. Some companies like Asana and Amazon adopt methodologies like Toyota’s “Five Whys” to understand the root cause of operational issues. 2 3

What tends to be much less common is introducing the same level of retrospection to projects and launches, and compiling the lessons learned to be shared across the entire organization for future product missions. A feature launches with a TechCrunch writeup, champagne glasses get clinked, and everyone celebrates over a job well done. But how effective was that effort at achieving the team’s goals? Or a team rewrites the infrastructure code and makes it 5% faster after a few months of work. Was that actually the best use of the team’s time? Without pausing to debrief and reviewing the data, it’s hard to know.

Of course, there’s some friction to doing this better. Teams might not have defined a clear goal or metric for a launch, making it difficult to assess whether it was successful. They might have closed debriefs and hesitate to share the lessons with the rest of the organization because they don’t want to declare their months of work to be a failure. Or they might just feel like there’s so much to do that they don’t have time to reflect – a frequent situation at startups. I know that I’ve been guilty of being in these various groups myself.

The result is that opportunities for building collective wisdom gets lost. Lessons learned, rather than getting compiled in a compendium like Flight Rules, get isolated in a few people’s heads. Costly mistakes get repeated. And when people leave, the collective wisdom decreases.

Fixing that loss requires work. You have to adopt a mindset where you’re open and receptive to feedback. You have to focus on increasing collective wisdom and not on assigning blame. If people around you do it as well, it can become part of the engineering culture and lower the friction involved of what can otherwise be a frightening exercise. And you have to develop a habit of retrospecting on your own missions.

Posted:

“A comprehensive tour of our industry's collective wisdom written with clarity.”

— Jack Heart, Engineering Manager at Asana

“Edmond managed to distill his decade of engineering experience into crystal-clear best practices.”

— Daniel Peng, Senior Staff Engineer at Google

“A comprehensive tour of our industry's collective wisdom written with clarity.”

— Jack Heart, Engineering Manager at Asana

“Edmond managed to distill his decade of engineering experience into crystal-clear best practices.”

— Daniel Peng, Senior Staff Engineer at Google

Grow Your Skills Beyond the Book

Listen to podcast interviews with top software engineers and watch master-level videos of techniques previously taught only in workshops and seminars.

Leave a Comment