AWS DevOps Pro Certification Blog Post Series: High Availability, Fault Tolerance and Disaster Recovery

Jun 08, 2019 - Reading time: 3 minutes.

This is part of the blog post series: AWS DevOps Pro Certification

What does the exam guide say?

To pass this domain, you'll need to know the following:

Determine appropriate use of multi-AZ versus multi-region architectures
Determine how to implement high availability, scalability, and fault tolerance
Determine the right services based on business needs (e.g., RTO/RPO, cost)
Determine how to design and automate disaster recovery strategies
Evaluate a deployment for points of failure

This domain is 16% of the overall mark for the exam.

What whitepapers are relevant?

According to the AWS Whitepapers page we should look at the following documents:

What services and products covered in this domain?

AWS Single Sign-On is Amazon's managed SSO service allow your users to sign in to AWS and other connected services using your existing Microsoft Active Directory (AD).
Amazon CloudFront is a managed Content Delivery Network (CDN) service.
Autoscaling resources - Amazon has two offerings Amazon Autoscaling and Amazon EC2 Auto Scaling
Amzon Route 53 is a managed Domain Name Service (DNS).
Databases
- Amazon RDS is a managed relational database service with a large choice of engines: Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database and SQL Server.
  - Amazon Aurora is part of the RDS offering but is unique in that it provides compatibility with MySQL and PostgreSQL engines whilst outperforming them considerably (5x for MySQL and 3x for PostgreSQL).
- Amazon DynamoDB is a managed NoSQL (non-relational) database service that can be used for storing key-value pairs or document based records.

What about other types of documentation?

If you have the time, by all means, read the User Guides, but they are usually a couple of hundred pages.

Amazon Single-Sign On
Amazon CloudFront
Amazon Autoscaling and Amazon EC2 Autoscaling
Amazon Route53
Databases
- Amazon RDS
  - Amazon Aurora
- Amazon DynamoDB

Alternatively, get familiar with the services using the FAQs:

Amazon Single-Sign On
Amazon CloudFront
Amazon Autoscaling and Amazon EC2 Autoscaling
Amazon Route53
Databases
- Amazon RDS
  - Amazon Aurora
- Amazon DynamoDB

You're all expected to know the APIs

Amazon CloudFront
Amazon Autoscaling and Amazon EC2 Autoscaling
Amazon Route53
Databases
- Amazon RDS
  - Amazon Aurora uses the same API as RDS
- Amazon DynamoDB

Before you panic, you'll start to spot a pattern with the API verbs.

And the CLI commands

Amazon CloudFront
Amazon Autoscaling and Amazon EC2 Autoscaling
Amazon Route53 has three subcommands: DNS and Healthchecking, Service Discovery and Domain Registration
Databases
- Amazon RDS
  - Amazon Aurora uses the same CLI as RDS
- Amazon DynamoDB has two sub commands: dynamodb and dynamodbstreams

As with the API, there are patterns to the commands.

High Availability, Fault Tolerance and Disaster Recovery, oh my!

Let's the basics out of the way and discuss the core concepts around this domain.

I'm going to use an excellent example provided by Patrick Benson in his blog post: The Difference Between Fault Tolerance, High Availability, & Disaster Recovery

An airplane has multiple engines and can operate with the loss of one or more engines. The design of the airplane has been made it resilient to falling out of the sky because of engine failure. This design is fault tolerant.

In terms of infrastructure, this is likely to be a managed service like RDS, where under the hood the database engine has multiple disks and CPUs to cope with catastrophic failure.

Whereas spare tire in car, isn't fault tolerant i.e. you have to stop change the tire, but having the spare tire in the first place makes the car still highly available. In terms of infrastructure is any type of technology like an autoscaling group.

It's very common for a solution to implement a system that is fault tolerant (resilience) and highly available (scalable).

Finally, ejector seats in Fighter aircraft are disaster recovery (DR) measure. The goal is to preserve the pilot, or in our case, the service after all other measures have failed (Fault Tolerance and HA).

Often in terms of infrastructure, this might be a standby infrastructure or database replica in a different AWS region and using Route 53 to point to the stand by infrastructure. Whilst it's still common for DR strategies to be manual, for this domain we'll be expected to provide an automated solution.

AWS DevOps Pro Certification Blog Post Series

Intro
Domain 1: SDLC automation
Domain 2: Configuration Management and Infrastructure as Code
Domain 3: Monitoring and Logging
Domain 4: Policies and Standards Automation
Domain 5: Incident and Event Response
Domain 6: High Availability, Fault Tolerance, and Disaster Recovery
- Amazon Single-Sign On
- Amazon CloudFront
- Auto Scaling
- Amazon Route53
- Databases
  - Amazon RDS
  - Amazon Aurora
  - Amazon DynamoDB