Landing and reusing a rocket has become a seemingly everyday activity that no longer inspires the sense of awe that it used to, despite the absurd amount of technological coordination involved. Call me an old man pining for the days of yesteryear, but launching a flying bomb into space, deploying a payload, and then landing that rocket back on Earth, only to launch it again a month later will always get me excited. 🚀

And while reusable rockets are an incredible accomplishment of human brainpower thanks to literal rocket scientists, often overlooked are the number of safeguards that had to be designed and implemented in case things don’t go according to plan. You no longer have a rocket on a one way ticket up and out over the ocean, safely away from humanity if it explodes (sorry fish). Now these rockets are coming back to land either on the coast or on drone ships, near civilization and expensive infrastructure. And thanks to the transparency of SpaceX, we have plenty of blooper reels to show how things can go spectacularly wrong.

Space and Software, about equally as hard

Just like space, in enterprise software, if something goes wrong in our application, it isn't as easy as quickly deploying a fix that is immediately pushed to every user (alright maybe space is a LITTLE more complicated but close enough 😉). Enterprise-scale companies ~~usually~~ always desire tighter control over their applications and upgrades are not as simple as they are in many B2C applications. Consequently, we need to exercise even more caution when designing our products and considering what happens when things break down. Bugs are going to happen, so let's be realistic and make sure they degrade gracefully and don't result in a catastrophic fireball!

SpaceX takes extra precautions in the safeguards they employ as part of their hardware engineering process to ensure any threats to human life (and expensive infrastructure) are limited, utilizing a Guilty until proven innocent model which assumes something will fail unless proven otherwise. Pessimism is rarely rewarded but in this case, often can save the day; while people people deserve the benefit of the doubt, software deserves the presumption of guilt.

A great example of this principle in effect can be seen in the landing sequence for the Falcon 9, SpaceX’s workhorse reusable rocket platform, as depicted by my scientifically accurate representation below.

Very scientific depiction of actual rocket science

After the rocket has launched, deployed its payload in space and is returning to Earth, the guidance computer assumes that something will go wrong during the landing sequence and intentionally aims the rocket to miss its intended landing point, either a landing pad on the coast or a drone ship in the ocean. It continues on this "miss trajectory" until the very last minute where, if and only if everything checks out, the rocket corrects its trajectory to land in the intended area. By following this seemingly procedure, SpaceX minimizes the risk to personnel and infrastructure posed by a failure early on in the landing sequence. Even if something as severe as a total engine failure occurs, the rocket is aimed for a (relatively) harmless splashdown in the ocean until the last minute.

Incorporating fail-safes into software development

This sequencing demonstrates a fail-safe in the most literal sense by assuming the system will fail and designing around that condition. Fail-safe designs are around us every day in the world and used extensively in systems engineering but we don’t regularly apply these same principles to software development. After all, a software product is just another system with many pieces depending on each other. In an enterprise world where the cost to fix a failure is often both high and painful, fail-safes can save the day if software is assumed Guilty until proven innocent.

The more common approach currently taken in software development is exception based. We assume the system will follow the happy path and we try to catch all the exceptions that could occur, individually and explicitly. Why doesn't this work well? It relies on our own ingenuity to determine ways in which the system could break. There’s no reason to put that burden on us when we can instead assume the system will break and design for graceful degradation.

To explore an example of this principle applied in the real world, a team I worked with built a feature to audit user activity in a software product. Knowing exactly what a user did in the system was critical to comply with federal regulations so our audit records had to be pristine and completely accurate. Perhaps not the most glamorous of features for you UX folks but it certainly posed a fun technical challenge!

Since our audit records were stored separately from the main application database, using the traditional exception-based approach, we had two options of how we could build the feature:

Option #1:

User takes action
Update the application database
Record the update in the activity audit database

What happens if recording the update in our activity audit database fails? Now we have an update that occurred in the system that we didn’t audit. 👎

Option #2:

User takes action
Record the update in the activity audit database
Update the application database

What happens if recording the update in the application database fails? Now we have an audit record of something that never actually happened. 👎

By applying the guilty until proven innocent principle, we ended up with a solution incorporates fail-safes and gives us an opportunity to identify anomalies by reconciling the matching attempted and succeeded records.

More importantly, we didn’t have to sit in a conference room for half a day trying to imagine where the feature could break down, we just assumed every step would fail and decided how to handle things when they did. Here’s where we landed:

User takes an action in the application
Record that a user attempted an action in the activity audit database
- When this fails, full stop since we haven’t made any changes yet
Update the application database with whatever the user changed
- When this fails, record that the action failed in the activity audit database
  - When this fails, we will catch this by reconciling orphaned “attempted” records
Record that a user successfully took an action in the activity audit database, paired with the attempted record
- When this fails, we will catch this by reconciling orphaned “attempted” records

Alternative applications

One drawback of this approach (after all, nobody’s perfect!) is that we can end up with a more complex implementation with these failsafes built into it. That’s not always a bad thing depending on the importance of the feature or product we are building, but like most things in life, the rule of when to apply this idea is “it depends.”

We also don’t have to implement every failsafe either. Sometimes, just using this approach as a thought experiment is good enough and the team can consciously decide that there are a few failure cases that aren’t worth worrying about. But without thinking this way, you may never know that those failures could occur (until your highest paying customer finds them for you) and I would rather decide explicitly to ignore them than decide by ignorance.

You can also use this approach to pressure test designs with the UX team. In my experience, initial designs of a feature usually address the happy path and as a Product manager, you are then responsible for vetting the design with the technical team to tease out the details and edge cases. If you assume that design is guilty until proven innocent, you can proactively address these issues before development even starts, hopefully reducing bugs and found work later on down the road.

This may feel a little technical but at the end of the day, as a Product manager, we are responsible for making sure our product doesn’t completely implode when a customer tries to use it. A great way to help avoid that is instilling this principle with your teams using a simple shift in language that starts with you, maybe with a few examples of rockets blowing up to really land the idea. 😉

Instead of specifying in our Acceptance criteria "if any of these things are not present, then show this error", which assumes the happy path, we can instead frame the requirements "if all of these things are present, then proceed to the next step.” This means we don’t need to specify every single possible permutation of what could go wrong, all you have to do is identify the criteria under which it will succeed and help development teams learn to make those same assumptions.

So what's the difference if you use just slightly different words? Changing the way our product teams approach problems. Instead of assuming our products will succeed, we should instead assume they will come crashing to the Earth at over 300 MPH, ending in a fiery inferno and millions of dollars in damage. And then when our products do succeed, they land gracefully and the world is none the wiser that they were aimed to miss the landing pad until the very last minute. 🚀