Ensuring Business Continuity with the AWS DR Strategies – Part 1: Backup and Restore

by Phil Sautter

Executive Summary

Today we are diving deep into AWS’s disaster recovery (DR) strategies, specifically focusing on the “Backup and Restore” approach. It’s not just about saving your data, but also ensuring your entire cloud environment can rebound quickly after a disaster.

Understanding AWS Disaster Recovery Strategies

When it comes to DR in AWS, we have a variety of strategies at our disposal. They range from simple, low-cost backups to sophisticated multi-region strategies, each designed for different needs and levels of risk. Today, we’re discussing the Backup and Restore strategy, which is a practical approach for preventing data loss or corruption. It can be further optimized to counter regional disasters by replicating data across different AWS regions.

When to Use Backup and Restore Strategy

Before we dig into how the Backup and Restore strategy works, it’s important to understand when this strategy is most suitable. The Backup and Restore strategy is ideal for applications with higher RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements, which can tolerate a certain degree of data loss or downtime.

It’s typically used for non-critical workloads, or when data doesn’t change frequently, and backups are not required to be up-to-the-minute. For instance, archived data, historical data, and infrequently used applications are perfect candidates for this approach.

Remember, selecting the right disaster recovery strategy is highly dependent on your business needs, budget constraints, and the criticality of the application. Understanding the trade-offs of each strategy can help you make the best decision for your specific circumstances. 

Now that we know when the Backup and Restore strategy is a good fit, let’s delve into how it works.

How Backup and Restore Works

The Backup and Restore strategy is based on creating backups of your data, safeguarding them, and restoring them when necessary. This strategy doesn’t stop at data. It also involves redeploying the infrastructure, configuration, and application code in the recovery region. AWS promotes the use of Infrastructure as Code (IaC) via services like AWS CloudFormation or HashiCorp Terraform for this purpose, allowing for swift and error-free redeployment.

AWS Services Supporting Backup and Restore

AWS provides several services that offer point-in-time recovery options, which let you restore your data to the exact state at the time of the backup. These include services such as Amazon Elastic Block Store (Amazon EBS), Amazon DynamoDB, Amazon RDS, Amazon Aurora DB, Amazon EFS, Amazon Redshift, Amazon Neptune, Amazon DocumentDB, and Amazon FSx for various file server solutions.

For Amazon S3, you can use Amazon S3 Cross-Region Replication (CRR) to continuously copy objects asynchronously to an S3 bucket in the DR region. This not only replicates data but also provides versioning, allowing you to choose your restoration point.

Centralized Backups with AWS Backup

AWS Backup is a centralized tool for configuring, scheduling, and monitoring AWS backup activities across various services and resources. It also facilitates copying backups across regions, a key facet of an effective disaster recovery strategy.

Backup Strategy Tips

The right backup strategy can significantly enhance your disaster recovery capabilities. Here are some tips to help you make the most of AWS Backup and other AWS resources:

  1. Back up your data, configuration, and infrastructure. Infrastructure as Code (IaC) can assist you in this task, enabling you to define all of the AWS resources in your workload for reliable deployment and redeployment across multiple AWS accounts and regions.
  2. Make backups frequently. The frequency of your backups should align with your recovery point objective (RPO). For databases or other frequently updated data sources, consider using services like Amazon RDS, which can automatically backup your data and transaction logs daily.
  3. Automate your backups. Manual backups can be error-prone and inconsistent. AWS offers services such as AWS Backup that help automate backups across various AWS services.
  4. Validate your backups regularly. By restoring data from your backup to a test environment, you can verify its integrity and ensure it works as expected.
  5. Enforce retention policies for your backups. Define how long you need to keep backups and implement a lifecycle policy to delete older backups that are no longer needed. This can also help you save on storage costs.
  6. Ensure the security and compliance of your backup data. Your backup data should be encrypted at rest and in transit. You may also need to demonstrate that your backups are secure and that you can recover data from any point in time for compliance purposes.
  7. Replicate your backups across regions. To mitigate risks associated with regional disruptions, ensure that you can recover your data even if an entire AWS region is down.
  8. Monitor your backup process. Use services like AWS CloudWatch to track backup events and send alerts in case of failures.
  9. Document your backup procedures and train your team. Regular training sessions can help ensure that your team is prepared to handle a disaster recovery scenario.

Restoration Process

In the event of a failover, the data stored in the DR region as backups needs to be restored. While AWS Backup provides manual restoration capabilities, it does not currently support scheduled or automatic restoration. However, this gap can be addressed by using the AWS SDK to call APIs for AWS Backup. This setup can be scheduled as a recurring job or triggered whenever a backup is completed using AWS EventBridge and Lambda.

The Importance of Testing Your Backups

Finally, a robust backup strategy includes testing your backups regularly. Regular testing validates your backups’ effectiveness, ensuring you are ready to handle a disaster situation efficiently. But how do we practically implement these strategies and principles into our existing systems?

Example: Deploying a Backup, Restore, and Cross-Region Replication Strategy on an Existing EC2 with Terraform

To transition from theory to practice, we’ll now add a tangible dimension to our discussion. We’ll illustrate how to deploy a Backup and Restore strategy—complete with cross-region replication—on an existing infrastructure. For this, we’ll utilize the power and flexibility of Terraform. So, without further ado, let’s dive into this hands-on application of our previously discussed principles.

Step 1: Setting Up Terraform

First, ensure that Terraform is set up and initialized in your workspace using the terraform init command.

Step 2: Creating the Terraform Configuration File

Next, create a Terraform configuration file (e.g., main.tf) that describes the resources required for the backup and replication process. Your configuration should look something like this:
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.67.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

provider "aws" {
  region = "us-west-2"
  alias  = "dr"
}

resource "aws_iam_role" "dlm_lifecycle_role" {
  name = "dlm_lifecycle_role"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "dlm.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}

resource "aws_iam_role_policy" "dlm_lifecycle_role_policy" {
  name = "dlm_lifecycle_role_policy"
  role = aws_iam_role.dlm_lifecycle_role.arn

  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:CreateSnapshot",
        "ec2:DeleteSnapshot",
        "ec2:DescribeVolumes",
        "ec2:DescribeSnapshots"
      ],
      "Resource": "*"
    }
  ]
}
EOF
}

data "aws_caller_identity" "current" {}

data "aws_iam_policy_document" "key" {
  statement {
    sid    = "Enable IAM User Permissions"
    effect = "Allow"

    principals {
      type        = "AWS"
      identifiers = ["arn:aws:iam::${data.aws_caller_identity.current.account_id}:root"]
    }

    actions   = ["kms:*"]
    resources = ["*"]
  }
}

resource "aws_kms_key" "dlm_cross_region_copy_cmk" {
  provider    = aws.dr
  description = "Example DR Region KMS Key"
  policy      = data.aws_iam_policy_document.key.json
}

resource "aws_dlm_lifecycle_policy" "example" {
  description        = "example DLM lifecycle policy"
  execution_role_arn = aws_iam_role.dlm_lifecycle_role.arn
  state              = "ENABLED"

  policy_details {
    resource_types = ["VOLUME"]

    schedule {
      name = "2 weeks of daily snapshots"

      create_rule {
        interval      = 24
        interval_unit = "HOURS"
        times         = ["23:45"]
      }

      retain_rule {
        count = 14
      }

      tags_to_add = {
        SnapshotCreator = "DLM"
      }

      copy_tags = false

      cross_region_copy_rule {
        target    = "us-west-2"
        encrypted = true
        cmk_arn   = aws_kms_key.dlm_cross_region_copy_cmk.arn
        copy_tags = true
        retain_rule {
          interval      = 30
          interval_unit = "DAYS"
        }
      }
    }

    target_tags = {
      Snapshot = "true"
    }
  }
}

This Terraform setup provides an automated, daily backup and cross-region replication of your EBS volumes that have the tag “Snapshot = true”. It retains two weeks of snapshots in the primary region and one month of snapshots in the DR region.

Let’s break it down a bit more:

– The required_providers block specifies the required AWS provider version.

– Two AWS provider blocks are defined: one for the “us-east-1” region and an alias “dr” for the “us-west-2” region. This allows you to manage resources across different regions.

– The aws_iam_role and aws_iam_role_policy resources create an IAM role and attach a policy that permits the DLM to create and delete snapshots.

– The aws_caller_identity data source is used to retrieve the details (Account ID, User ID) of the entity making the API call.

– The aws_iam_policy_document data source is used to generate an IAM policy document that allows “kms:*” actions for the root user of the account.

– The aws_kms_key resource creates a Key Management Service (KMS) key in the disaster recovery (DR) region (“us-west-2”) to encrypt the replicated snapshots.

– The aws_dlm_lifecycle_policy resource is the heart of this configuration. It describes a DLM lifecycle policy which automates the EBS snapshot management:
   – The create_rule specifies that snapshots should be created daily at 23:45.
   – The retain_rule ensures that the 14 most recent snapshots are retained in the “us-east-1” region.
   – The cross_region_copy_rule defines the rule for copying snapshots to the DR region (“us-west-2”). It specifies that the copied snapshots should be encrypted using the previously created KMS key and the most recent 30 snapshots are retained in the DR region.

Important note: Ensure to replace the placeholders in these configurations with the actual values from your AWS environment and keep your sensitive information, like AWS access and secret keys, secure while using Terraform.

Step 3: Applying the Terraform Configuration

Run terraform plan to preview the actions and ensure everything is correct. Then, execute the plan using terraform apply. Terraform will create and replicate your snapshot.

Step 4: Restoring From the Snapshot

In the event of a disaster, it is advised to manually restore your data using either the AWS Console or AWS CLI. This is because, as explained by Martin Adkins, a core open source maintainer of Terraform, the restoration process is a “one-time imperative operation”. It doesn’t align with Terraform’s declarative model, which is why we’ve only automated the backup and replication processes using Terraform. The restoration process involves creating a new EBS volume from the snapshot, which can be done in either the original region or the disaster recovery region, depending on where the disaster has occurred.

The AWS CLI command for creating a volume from a snapshot in a disaster recovery scenario is:

aws ec2 create-volume --snapshot-id snap-01234567890abcdef --availability-zone us-west-1a --region us-west-1

Replace snap-01234567890abcdef with your snapshot id and us-east-1a with the appropriate availability zone.

With Terraform in your toolkit, implementing a Backup, Restore, and Cross-Region Replication strategy is within your reach. This is a significant stride towards fortifying your AWS applications against disasters.

This concludes our exploration of AWS’s Backup and Restore disaster recovery strategy. Understanding and implementing this strategy is a critical step towards maintaining the resilience of your AWS workloads. Stay tuned for the next part of this series where we will delve into the Pilot Light DR strategy on AWS.

Cost Optimization

Issue: Small AWS deployment with little management oversight and a lack of cloud skills internal to the organization moving from traditional infrastructure to SaaS and cloud based solutions.

 

What we did

  1. AWS Audit
  2. Cost Optimization Review
  3. Ongoing Monitoring

 

Result:

  • Eliminated unused storage volumes and the old application server no longer in use, the charges for AWS resulted in a savings of 51% per month.
  • We’ll continue to monitor AWS billing and finance to ensure maintenance of savings and identify other future changes.

Cost Optimization

Issue: Small AWS deployment with little management oversight and a lack of cloud skills internal to the organization moving from traditional infrastructure to SaaS and cloud based solutions.

 

What we did

  1. AWS Audit
  2. Cost Optimization Review
  3. Ongoing Monitoring

 

Result:

  • Eliminated unused storage volumes and the old application server no longer in use, the charges for AWS resulted in a savings of 51% per month.
  • We’ll continue to monitor AWS billing and finance to ensure maintenance of savings and identify other future changes.