AWS DevOps Pro Certification Blog Post Series: High Availability, Fault Tolerance and Disaster Recovery
This is part of the blog post series: AWS DevOps Pro Certification
What does the exam guide say?
To pass this domain, you'll need to know the following:
- Determine appropriate use of multi-AZ versus multi-region architectures
- Determine how to implement high availability, scalability, and fault tolerance
- Determine the right services based on business needs (e.g., RTO/RPO, cost)
- Determine how to design and automate disaster recovery strategies
- Evaluate a deployment for points of failure
This domain is 16% of the overall mark for the exam.
What whitepapers are relevant?
According to the AWS Whitepapers page we should look at the following documents:
- Backup and Recovery Approaches Using AWS (June 2016)
- Building Fault-Tolerant Applications on AWS (October 2011)
What services and products covered in this domain?
- AWS Single Sign-On is Amazon's managed SSO service allow your users to sign in to AWS and other connected services using your existing Microsoft Active Directory (AD).
- Amazon CloudFront is a managed Content Delivery Network (CDN) service.
- Autoscaling resources - Amazon has two offerings Amazon Autoscaling and Amazon EC2 Auto Scaling
- Amzon Route 53 is a managed Domain Name Service (DNS).
- Databases
- Amazon RDS is a managed relational database service with a large choice of engines: Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database and SQL Server.
- Amazon Aurora is part of the RDS offering but is unique in that it provides compatibility with MySQL and PostgreSQL engines whilst outperforming them considerably (5x for MySQL and 3x for PostgreSQL).
- Amazon DynamoDB is a managed NoSQL (non-relational) database service that can be used for storing key-value pairs or document based records.
- Amazon RDS is a managed relational database service with a large choice of engines: Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database and SQL Server.
What about other types of documentation?
If you have the time, by all means, read the User Guides, but they are usually a couple of hundred pages.
- Amazon Single-Sign On
- Amazon CloudFront
- Amazon Autoscaling and Amazon EC2 Autoscaling
- Amazon Route53
- Databases
Alternatively, get familiar with the services using the FAQs:
- Amazon Single-Sign On
- Amazon CloudFront
- Amazon Autoscaling and Amazon EC2 Autoscaling
- Amazon Route53
- Databases
You're all expected to know the APIs
- Amazon CloudFront
- Amazon Autoscaling and Amazon EC2 Autoscaling
- Amazon Route53
- Databases
- Amazon RDS
- Amazon Aurora uses the same API as RDS
- Amazon DynamoDB
- Amazon RDS
Before you panic, you'll start to spot a pattern with the API verbs.
And the CLI commands
- Amazon CloudFront
- Amazon Autoscaling and Amazon EC2 Autoscaling
- Amazon Route53 has three subcommands: DNS and Healthchecking, Service Discovery and Domain Registration
- Databases
- Amazon RDS
- Amazon Aurora uses the same CLI as RDS
- Amazon DynamoDB has two sub commands: dynamodb and dynamodbstreams
- Amazon RDS
As with the API, there are patterns to the commands.
High Availability, Fault Tolerance and Disaster Recovery, oh my!
Let's the basics out of the way and discuss the core concepts around this domain.
I'm going to use an excellent example provided by Patrick Benson in his blog post: The Difference Between Fault Tolerance, High Availability, & Disaster Recovery
An airplane has multiple engines and can operate with the loss of one or more engines. The design of the airplane has been made it resilient to falling out of the sky because of engine failure. This design is fault tolerant.
In terms of infrastructure, this is likely to be a managed service like RDS, where under the hood the database engine has multiple disks and CPUs to cope with catastrophic failure.
Whereas spare tire in car, isn't fault tolerant i.e. you have to stop change the tire, but having the spare tire in the first place makes the car still highly available. In terms of infrastructure is any type of technology like an autoscaling group.
It's very common for a solution to implement a system that is fault tolerant (resilience) and highly available (scalable).
Finally, ejector seats in Fighter aircraft are disaster recovery (DR) measure. The goal is to preserve the pilot, or in our case, the service after all other measures have failed (Fault Tolerance and HA).
Often in terms of infrastructure, this might be a standby infrastructure or database replica in a different AWS region and using Route 53 to point to the stand by infrastructure. Whilst it's still common for DR strategies to be manual, for this domain we'll be expected to provide an automated solution.
AWS DevOps Pro Certification Blog Post Series
- Intro
- Domain 1: SDLC automation
- Domain 2: Configuration Management and Infrastructure as Code
- Domain 3: Monitoring and Logging
- Domain 4: Policies and Standards Automation
- Domain 5: Incident and Event Response
- Domain 6: High Availability, Fault Tolerance, and Disaster Recovery
- Amazon Single-Sign On
- Amazon CloudFront
- Auto Scaling
- Amazon Route53
- Databases
- Amazon RDS
- Amazon Aurora
- Amazon DynamoDB