User centred disaster recovery for digital platforms

6 min readOct 20, 2020

DISCLAIMER: This is not entirely my doing. I learned it from my time with the gov.uk PaaS team, and more specifically Graham Bleach. So I apologise in advance for any errors in what follows, which is a simple approach to planning disaster recovery for a platform.

Disaster Recovery as a team sport

This is incredibly important to understand, this is not the preserve of the technical team alone. To make the right decisions, and avoid unnecessary work investing time and effort in disaster recovery implementation for things that aren’t really that important, the following approach needs to involve the whole team, with product/service representatives taking part.

By everyone being involved, the product team members can understand the technical complexity and inferred cost of any disaster recovery approach, and the impact ‘things breaking’ will have on the end user experience. Equally, the technical team can get a handle on what actually matters to the end users, and to the product team, focussing on user outcomes, not technical ones.

Disaster Recovery is not the same as High Availability

Let me start with this up front. Designing a system to be highly available, with components that can survive failure without any impact to the end user is not the same thing as disaster recovery. I know this may be obvious to some people, but it is not the same thing. I’ll write about that another time,

Disaster recovery is taking the position that something will definitely fail, in a big way, that will introduce down time and based on that position, you plan and test how you recover from that failure.

I’ll say the last bit again. TEST. IT. Having a plan is pointless if you don’t actually test it.

There are plenty of technical guides to disaster recovery and certainly some key concepts like recovery time objective and recovery point objective to understand, however, these do not actually give you an approach to mapping and prioritising what to deal with.

Outline of the approach

There are a number of steps to take, which will help you map out what you can promise vs what the reality is. This turns into a clear set of principles around how you make decisions during build.

Start with User Needs
Map your platform components
Model failure scenarios
Evaluate, implement and test recovery

I’ll run through the above shortly, but first, we need a fake example to bring it to life a bit!

Dogs as a Service

Let’s invent a fake digital service. DogWalkPro, it’s a cool service that let’s people share Dog walking spots. Dog Strava.

Here are the core journeys a user can take:

Upload dog walk pictures (free)
Recommend dog walk locations (free)
Sign Up / Log in and add favourite dog walk locations to your profile(paid for)
Sign Up / Log in and connect to other dog walkers in your area (paid for)

Start with User Needs

No other way of putting it. As a an engineer, the temptation is to look at all your platform components and then wonder about all the excellent failover options that they might provide.

“Oh wow”, you might say, “I can do active-active replication across regions, let me just YAML that into existence and all will be well.”

That’s fine, you should be thinking about resilience and availability, make that part of your over-arching design principles.

Don’t start with the technology capabilities. Start with what your users need.

I’d also suggest to remember your users are also people who might not be the end customer…what about people in the invoicing department, or logistics, or the analytics team who we send data to to help identify new awesome dog walking routes for your particular breed based on their leg size.

Define your platforms SLAs. What promises are you making to your users, both in terms of normal service, and time to recovery when something goes wrong, and from what point before or at the time it went wrong. Something will go wrong.

The best way to do this (and this should line up with how you are building the thing) is to look at the journeys a user can complete on your platform, and decide how critical those journeys are.

Let’s take our four journeys:

Upload dog walk pictures (free)
Recommend dog walk locations (free)
Sign Up / Log in and add favourite dog walk locations to your profile(paid for)
Sign Up / Log in and connect to other dog walkers in your area (paid for)

What’s important with the above journeys in terms of capability/functionality to let users complete them?

Uploading photos of walks (Journey 1&2)
Pinning locations on a map (Journey 1&2)
Sign Up (Journey 3/4)
Login (Journey 3/4)
Favourite Dog Walk Locations (Journey 3)
Connect to other users (Journey 4)

At this point we have a decent idea of what users should be able to do, and what allows them to do it.

We now need to decide which are most important (yes all of them, but…come on…you have to choose) so, we decide that having photos is super important, our users love this, and it’s what brings people into the paid part of the platform.

Login and Signup gives us money, so we need that to be working for sure, so that also needs to be high on our list, but it’s not as crucial as upload because really, we are trying to get lots of users to create our dog walk locations right now.

Uptake on our connecting to other users feature has been poor though, so we care less about that.

So, we have a good idea about what we really care about.

Hopefully you are starting to see here that you are focussing time and effort (and money) on what matters most to users.

Map out your platform containers and components

This is where useful (they reflect reality) architecture diagrams come into play. Not XML cathedrals that are a monument to silos, but an actual this is what production looks like diagram.

For our example i’m going to use the language of the C4 Model (read more here)

I’m also going to break the service into a few different systems.

Log-in / Sign Up system (Journey 2/3)
Dog Walk System (Journey 1)
User Favourites System (Journey 2)
User Matching System (Journey 3)

Within each of those systems we will have a number of containers, so, for example, the dog walk system might have

Dog Walk Front End
Dog Walk API
Dog Walk Metadata Store
Dog Walk Picture Store
Dog Walk Location Store (3rd Party Location as a Service)

Model your failure scenarios

Our next step is to build a plain old spreadsheet listing Journeys, Containers those journeys rely on, our Recovery Time (how long) and Point in Time (how long ago) objectives for each and importantly, if we have implemented and tested these. This view will be informed by the second map, which is the individual components within your containers.

Base the recovery and point in time conversations on the business/user value of recovering those components. Some Journeys might not matter as much.

Here’s an example:

Once you have the headline information, you need to get deeper into the detail, break your containers out into components, and as a team have a conversation about what could go wrong, then assume it will, this is where you can start to identify follow up actions to take.

Here is an example of the Container to Component Map, and some possible failure modes identified as well as some mitigations (high availability NOT disaster recovery remember!)

In completing this exercise you as a team will start to identify not only specific components failure modes, but potential cross cutting concerns…like who pays the bill!

Here is a link to the example spreadsheet:

DR Map

DR Map User Need,Container,Recovery Time Objective,Recovery Point Objective All user facing interactions,Dog Walk Front…

docs.google.com

Evaluate, Implement and Test

Now you have a map, build a plan based on prioritised user journeys. Make informed decisions. Be honest about what really matters for your platform.

I recommend using Open Design Proposals and Tech Spikes to evaluate and decide on how to implement your defined approach.

Your approach to Disaster Recovery is only good if you test it works, often.

Thanks to Kyle Thompson, Joe McGrath and Will Hamill for the proof reads and feedback