AWS DevOps Pro Certification Blog Post Series: High Availability, Fault Tolerance and Disaster Recovery

This is part of the blog post series: AWS DevOps Pro Certification

What does the exam guide say?

To pass this domain, you'll need to know the following:

This domain is 16% of the overall mark for the exam.

What whitepapers are relevant?

According to the AWS Whitepapers page we should look at the following documents:

What services and products covered in this domain?

What about other types of documentation?

If you have the time, by all means, read the User Guides, but they are usually a couple of hundred pages.

Alternatively, get familiar with the services using the FAQs:

You're all expected to know the APIs

Before you panic, you'll start to spot a pattern with the API verbs.

And the CLI commands

As with the API, there are patterns to the commands.

High Availability, Fault Tolerance and Disaster Recovery, oh my!

Let's the basics out of the way and discuss the core concepts around this domain.

I'm going to use an excellent example provided by Patrick Benson in his blog post: The Difference Between Fault Tolerance, High Availability, & Disaster Recovery

An airplane has multiple engines and can operate with the loss of one or more engines. The design of the airplane has been made it resilient to falling out of the sky because of engine failure. This design is fault tolerant.

In terms of infrastructure, this is likely to be a managed service like RDS, where under the hood the database engine has multiple disks and CPUs to cope with catastrophic failure.

Whereas spare tire in car, isn't fault tolerant i.e. you have to stop change the tire, but having the spare tire in the first place makes the car still highly available. In terms of infrastructure is any type of technology like an autoscaling group.

It's very common for a solution to implement a system that is fault tolerant (resilience) and highly available (scalable).

Finally, ejector seats in Fighter aircraft are disaster recovery (DR) measure. The goal is to preserve the pilot, or in our case, the service after all other measures have failed (Fault Tolerance and HA).

Often in terms of infrastructure, this might be a standby infrastructure or database replica in a different AWS region and using Route 53 to point to the stand by infrastructure. Whilst it's still common for DR strategies to be manual, for this domain we'll be expected to provide an automated solution.

AWS DevOps Pro Certification Blog Post Series