当前位置:   article > 正文

Windows Cold DR using PD Async Replication

Windows Cold DR using PD Async Replication

Optimize costs for Windows workloads using Persistent Disk Async Replication

Before launching your critical workloads in the cloud, you need to plan for disaster recovery. A robust disaster recovery (DR) strategy can minimize business disruption by enabling rapid recovery in another region. For most traditional Windows Server environments, DR also requires careful consideration of Microsoft licensing compliance, costs, and how quickly you can recover — including the time it takes to manually remediate any domain issues on your member servers.

Google Cloud offers various options for robust protection. In this blog, we focus on Persistent Disk Asynchronous Replication (PD Async Replication), which has been generally available since June 2023.

PD Async Replication delivers quick recovery from unforeseen disasters. It replicates storage blocks across regions, achieving low Recovery Point Objective (RPO) under one minute and helps reduce Recovery Time Objective (RTO). In the unlikely event of a regional compute outage in the workload’s primary region, PD Async Replication helps ensure workload data is available in the DR region by replicating both boot and data disks. These replicated workloads can then be spun up quickly and programmatically using tools like Terraform or the gcloud SDK to minimize the business impact.Customers running Windows Server on Google Compute Engine can see particular benefit from this capability, as it minimizes licensing costs, speeds up recovery, and reduces the amount of manual intervention that might be required in a traditional DR solution. The on-demand licensing model associated with Windows Server instances incurs charges only for running virtual machines, not disks. Notably, if PD Async Replication is used with disks that are not attached to running VMs, there are no licensing costs. Therefore, limiting VM activation within the DR region solely to actual disaster scenarios (including testing) presents an opportunity to minimize licensing costs.

In the following example, we examine a small Windows Server with Active Directory environment. Let's assume there is an on-premises component running Active Directory domain controllers, but there are also domain controllers in Google Cloud. This example has a single production region in us-east4. The DR region is designated as us-central1. Cloud DNS is configured according to best practices for Domain Forwarding and Domain Peering for Active Directory environments. Let's take a look at a sample architecture diagram, and dissect it further:

https://storage.googleapis.com/gweb-cloudblog-publish/images/image1_8ILFVkA.max-1500x1500.png

PD Async Replication plays a crucial role in safeguarding data continuity within this DR environment. It ensures the contents of all boot and data disks, including Active Directory information, from the production region (us-east4), are mirrored onto disks in the designated DR region (us-central1). This guarantees data and Active Directory availability even if a catastrophic event renders us-east4 inaccessible.

While storage replication is paramount, recovering quickly hinges on additional configurations. In typical DR scenarios, different Classless Inter-Domain Routing (CIDR) blocks are implemented for both environments. However, this specific example differs by utilizing the same IP Subnet range for both production and DR. This strategic choice facilitates rapid Windows Server recovery by eliminating the need to reconfigure the network interface, on-premises routing adjustments, or Active Directory DNS changes. Notably, the DR-to-shared-svcs peering link remains incomplete, effectively circumventing potential IP conflict issues.

During a DR event, the following processes occur in quick succession:

  • Production VMs are powered off (if possible — depending on the disaster scenario) 

  • PD Async Replication is stopped

  • VPC Peering from the Production environment is severed

  • Disaster Recovery VMs are built using the same configuration as the Production VMs

    • Instance configuration including Network IP will be the same

    • Replicated Boot and Data disks from Production VMs are used during instance creation

  • VPC Peering to the DR environment is established

By preserving the IP address along with the low RPO/RTO replication of the disks, the DR VMs boot up with the existing Production configuration without the need to rejoin the Domain. This also means that you can follow this same process and mirror disks in DR back to production for easy failback once your disaster scenario is over, providing a full end-to-end solution.

You can use PD Async Replication to protect a variety of Windows workloads including SQL Server; it is especially useful in cases where the use of native replication technology (such as SQL Server AlwaysOn) might be cost-prohibitive due to the additional licensing required. 

You can find the Terraform code for this example deployment, along with detailed instructions on how to use it, in this repository

This blog post introduces PD Async Replication and its key features. However, for in-depth understanding and to confirm it aligns with your needs, please review the product page. You can find pricing for PD Asynchronous Replication here.

Windows Cold DR using PD Async Replication

This code will facilitate the creation of 10 VMs that will auto-join a domain, a DR failover, and a failback. The domain controller was built manually to support this solution, and should be created first if wishing to test with one. This repo contains the code to build the secondary boot disk and establish the asyncrhonous replication for the manually created domain controller. This solution uses Persistent Disk Asynchronous Replication, and requires the use of an instance template to faciliate the creation of Windows servers. A sample gcloud command is included in setup\templatefiles.

VPC networks, subnets, peerings, and Cloud DNS configurations are not provided with this code at this time, but can be referenced in the architecture diagram below.

Locals are used in each folder for ease of manipulation. These can be extracted into CSV files for customization and use within your own organiziation. For this demo, the locals contain the server names and some other information needed for DR and failback. Please review the contents and modify, if desired.

Assumptions for this repo

  • This is designed for region to region failover -- if a single zone is having issues, there might be better options.
  • As of 01/2023, managed services like Cloud SQL and networking products like Private Service Connect have not been tested.
  • During a DR event, it is your responsibility to ensure no connectivity is flowing into the production environment. This solution does not cover any move of external IP addresses or load balancing, nor egress from the DR VPC.
    • Public IPs may or may not change depending on the presence and specifics of web-facing apps, internet access specifics, security services, etc.
  • This was designed with Shared VPC use in mind. It is your responsibility to ensure all permissions and subnets are assigned pre-DR.
  • Each production region will require its own Shared VPC due to the VPC Peering requirement. You may opt for a single DR Shared VPC, but could also have mulitple DR Shared VPCs to mirror production.
  • The Cloud Routers and Cloud NAT are there for outbound internet access and are optional components.
  • There are service projects for production, service projects for DR, and separate Shared VPCs to accomodate each environment using the same IP range. This solution requires an architecture similar to this (on premises components fully optional):

Windows Cold DR with PD Async Replication

Setup Folder

Contains code to spin up 10 Windows Server servers and join them to a domain, create DR boot disks for all 10 and the domain controller (if using) in the DR region, and create the asynchronous replication pairs for all disks. The code is written to preserve IP addresses.

DR Folder

Contains code to spin up DR servers using the replicated disks and IP addresses from production (including the domain controller), create failback boot disks in the production region, and create the failback async replication pairs for all disks. The IP addresses are preserved for failback.

Failback Folder

Contains code to spin up failback/production servers using the replicated disks and IP addresses from DR, and includes code to recreate DR boot disks and async replication pairs to prepare for the next DR event.

How to Setup the Environment

As of 01/2024, this repo does not contain the code necessary to build out an entire environment. Some general steps and guidelines are provided here in order to help with this demo.

Note

These instructions assume that you are building out the same environment as shown in the Architecture diagram

Organization Requirements

This demo uses a Shared VPC architecture which requires the use of a Google Cloud Organization.

IAM Requirements

The following IAM Roles are required for this demo

  1. Project Creator
  2. Project Deleter
  3. Billing Account User for the Billing Account in your Organization
  4. Compute Admin
  5. Compute Shared VPC Admin

Building The Environment

  1. Create three (3) projects in Google Cloud
    • Project #1: Shared VPC Host Project
    • Project #2: Service Project for Production
    • Project #3: Service Project for DR
    • Enable Compute Engine, Cloud DNS, and Cloud IAP APIs in all three projects (you will have to switch between projects in gcloud)

      gcloud services enable compute.googleapis.com dns.googleapis.com iap.googleapis.com

  2. Create three (3) VPCs in the Shared VPC Host Project
    • VPC #1: Shared Services VPC as shared-svcs with global routing
      • Shared Services Subnet #1: sn-shrdsvcs-us-east4 in us-east4 with IP range 10.0.0.0/21 and Private Google Access enabled
      • Shared Services Subnet #2: sn-shrdsvcs-us-central1 in us-central1 with IP range 10.20.0.0/21 and Private Google Access enabled
    • VPC #2: Production VPC as app-prod with global routing
      • Production Subnet #1: prod-app-us-east4 in us-east4 with IP range 10.1.0.0/21 and Private Google Access enabled
    • VPC #3: DR VPC as app-dr with global routing
      • DR Subnet #1: dr-app-us-central1 in us-central1 with IP range 10.1.0.0/21and Private Google Access enabled Note The IP range in DR is the same as Production
  3. Create a VPC Peering configuration between the shared-svcs VPC and app-prod VPC
    • You will also need to create a VPC Peering from the app-prod VPC to the shared-svcs VPC
    • Optional You can pre-stage the peering from the app-dr VPC to the shared-svcs VPC to save time in a DR event, but it is not required at this time.
export shared_vpc_host_project="REPLACE_WITH_SHARED_VPC_HOST_PROJECT_PROJECT_ID"

# Create VPC Peering between shared-svcs and prod-vpc
gcloud compute networks peerings create shared-svcs-vpc-to-prod-vpc \
--project=$shared_vpc_host_project \
--network=shared-svcs \
--peer-network=app-prod

# Create VPC Peering between prod-vpc and shared-svcs
gcloud compute networks peerings create prod-vpc-to-shared-svcs-vpc \
--project=$shared_vpc_host_project \
--network=app-prod \
--peer-network=shared-svcs

# Run this command to verify that the new Peerings are showing as ACTIVE
gcloud compute networks peerings list \
--project=$shared_vpc_host_project \
--flatten="peerings[]" \
--format="table(peerings.name,peerings.state)"
  1. Enable the Shared VPC Host Project
  2. Attach the Production and DR Service Projects
    • Ensure that you share prod-app-us-east4 with the Production Project only
    • Ensure that you share dr-app-us-central1 with the DR Project only
  3. In the Shared VPC Host Project, configure Cloud DNS per best practices to support your domain and Active Directory. You will need a forwarding zone for your domain associated with Shared Services, and DNS Peering from Shared Services to the other VPCs to support domain resolution.
    • More info on Cloud DNS can be found here.
  4. Optional If you wish to test with an Active Directory Domain, you can set up a Domain Controller in the Production Project using the app-prod VPC

Building The Test Servers

Important

If you are not using a Domain Controller to test, please ensure that the use-domain-controller variable in terraform.tfvars is set to false

  1. Create an Instance template in the Service Project for Production

    • A sample gcloud command has been provided in the /setup/templatefiles folder for your convenience
  2. Navigate to the /setup folder and update the terraform.tfvars file with the appropriate variables for your environment

    • If you are using a Domain Controller, navigate to the /setup/templatefiles folder and update ad-join.tpl with your values.
  3. While in the /setup directory run the terraform commands

    • terraform init
    • terraform plan -out tf.out (there should be 42 resources to add)
    • terraform apply tf.out

    The default configuration will deploy

    • Ten (10) Windows Servers (Domain joined if configured)
    • Secondary boot disks in the DR Project
    • Async replication to DR

Note

Please allow 15-20 minutes for initial replication to complete. If using your own systems with larger disks, initial replication time may be longer. The initial replication is complete when the compute.googleapis.com/disk/async_replication/time_since_last_replication metric is available in Cloud Monitoring.

  1. # --- From the Service Project for Production ---
  2. # Open Cloud Monitoring > Metrics expolorer > Click on "< > MQL" on the top right > Paste the following MQL
  3. # If nothing loads it means that replication has not taken place yet.
  4. # You can enable auto-refresh by clicking the button right next to "SAVE CHART"
  5. fetch gce_disk
  6. | metric
  7. 'compute.googleapis.com/disk/async_replication/time_since_last_replication'
  8. | group_by 1m,
  9. [value_time_since_last_replication_mean:
  10. mean(value.time_since_last_replication)]
  11. | every 1m
  12. | group_by [],
  13. [value_time_since_last_replication_mean_aggregate:
  14. aggregate(value_time_since_last_replication_mean)]

DR Failover

Important

If you are not using a Domain Controller to test, please ensure that the use-domain-controller variable in terraform.tfvars is set to false

  1. Simulate a DR event (e.g. shut down the production VMs)
export app_prod_project="REPLACE_WITH_SERVICE_PROJECT_FOR_PRODUCTION_PROJECT_ID"
export zone=$(gcloud compute instances list --project=$app_prod_project --format="value(zone.basename())" | head -n 1)
for gce_instance in $(gcloud compute instances list --project=$app_prod_project --format="value(selfLink.basename())")
do
	gcloud compute instances stop $gce_instance --zone $zone --project=$app_prod_project
done
  1. Navigate to the /setup folder and rename prod-async-rep.tf to prod-async-rep.tf.dr. While in the /setup directory run the terraform commands to stop the asynchronous replication.

    • terraform plan -out tf.out (there should be 10 or 11 resources to destroy)
    • terraform apply tf.out
  2. Sever the Peering from the shared-svcs VPC to the app-prod VPC and establish a VPC Peering from the shared-svcs VPC to the app-dr VPC

export shared_vpc_host_project="REPLACE_WITH_SHARED_VPC_HOST_PROJECT_PROJECT_ID"

# Sever the Peering to prod-vpc
gcloud compute networks peerings delete "shared-svcs-vpc-to-prod-vpc" \
--project=$shared_vpc_host_project \
--network=shared-svcs

# Create VPC Peering between shared-svcs and dr-vpc
gcloud compute networks peerings create shared-svcs-vpc-to-dr-vpc \
--project=$shared_vpc_host_project \
--network=shared-svcs \
--peer-network=app-dr

# Create VPC Peering between dr-vpc and shared-svcs
gcloud compute networks peerings create dr-vpc-to-shared-svcs-vpc \
--project=$shared_vpc_host_project \
--network=app-dr \
--peer-network=shared-svcs

# Run this command to verify that the new Peerings are showing as ACTIVE
gcloud compute networks peerings list \
--project=$shared_vpc_host_project \
--flatten="peerings[]" \
--format="table(peerings.name,peerings.state)"
  1. Navigate to the /dr folder and update the terraform.tfvars file with the appropriate variables for your environment

  2. While in the /dr folder, run the terraform commands to create the DR VMs using the replicated disks from Production

    • terraform init
    • terraform plan -out tf.out (there should be 10 or 11 resources to create)
    • terraform apply tf.out
  3. Validate all servers and applications are back online and connected to the domain

  4. Delete the old production VMs and their disks

export app_prod_project="REPLACE_WITH_SERVICE_PROJECT_FOR_PRODUCTION_PROJECT_ID"
export zone=$(gcloud compute instances list --project=$app_prod_project --format="value(zone.basename())" | head -n 1)
for gce_instance in $(gcloud compute instances list --project=$app_prod_project --format="value(selfLink.basename())")
do
	gcloud compute instances delete $gce_instance --zone $zone --project=$app_prod_project --quiet
done
  1. Rename stage-failback-async-boot-disks.tf.dr to stage-failback-async-boot-disks.tf and stage-failback-async-rep.tf.dr to stage-failback-async-rep.tf

  2. While in the /dr folder, run the terraform commands to create new boot disks in the Production region for failback, and the associated async replication pairs from DR

    • terraform plan -out tf.out (should see 22 resources to add)
    • terraform apply tf.out

Note

Please allow 15-20 minutes for initial replication to complete. If using your own systems with larger disks, initial replication time may be longer. The initial replication is complete when the compute.googleapis.com/disk/async_replication/time_since_last_replication metric is available in Cloud Monitoring.

  1. # --- From the Service Project for DR ---
  2. # Open Cloud Monitoring > Metrics expolorer > Click on "< > MQL" on the top right > Paste the following MQL
  3. # If nothing loads it means that replication has not taken place yet.
  4. # You can enable auto-refresh by clicking the button right next to "SAVE CHART"
  5. fetch gce_disk
  6. | metric
  7. 'compute.googleapis.com/disk/async_replication/time_since_last_replication'
  8. | group_by 1m,
  9. [value_time_since_last_replication_mean:
  10. mean(value.time_since_last_replication)]
  11. | every 1m
  12. | group_by [],
  13. [value_time_since_last_replication_mean_aggregate:
  14. aggregate(value_time_since_last_replication_mean)]
  1. Navigate to the /failback folder and update the terraform.tfvars file with the appropriate variables for your environment to prepare for production failback

  2. Optional While in the /failback folder, run the terraform commands to prepare for failback

    • terraform init
    • terraform plan -out tf.out (there should be 10 or 11 resources to create)

Production Failback

Important

If you are not using a Domain Controller to test, please ensure that the use-domain-controller variable in terraform.tfvars is set to false

  1. Shut down DR VMs
export app_dr_project="REPLACE_WITH_SERVICE_PROJECT_FOR_DR_PROJECT_ID"
export zone=$(gcloud compute instances list --project=$app_dr_project --format="value(zone.basename())" | head -n 1)
for gce_instance in $(gcloud compute instances list --project=$app_dr_project --format="value(selfLink.basename())")
do
	gcloud compute instances stop $gce_instance --zone $zone --project=$app_dr_project
done
  1. Navigate to the /dr folder

  2. Rename stage-failback-async-rep.tf to stage-failback-async-rep.tf.dr

  3. While in the /dr folder, run the terraform commands to stop replication

    • terraform plan -out tf.out (should see 11 resources to destroy)
    • terraform apply tf.out
  4. In the console, sever VPC peering from shared-svcs to app-dr, and establish VPC peering from shared-svcs to app-prod

export shared_vpc_host_project="REPLACE_WITH_SHARED_VPC_HOST_PROJECT_PROJECT_ID"

# Sever the Peering to dr-vpc
gcloud compute networks peerings delete "shared-svcs-vpc-to-dr-vpc" \
--project=$shared_vpc_host_project \
--network=shared-svcs

# Create VPC Peering between shared-svcs and prod-vpc
gcloud compute networks peerings create shared-svcs-vpc-to-prod-vpc \
--project=$shared_vpc_host_project \
--network=shared-svcs \
--peer-network=app-prod

# Run this command to verify that the new Peerings are showing as ACTIVE
gcloud compute networks peerings list \
--project=$shared_vpc_host_project \
--flatten="peerings[]" \
--format="table(peerings.name,peerings.state)"
  1. Navigate to the /failback folder

  2. While in the /failback folder, run the terraform commands to recover your VMs in the original production region using the replicated disks from DR

    • terraform plan -out tf.out (should see 10 or 11 resources to add)
    • terraform apply tf.out
  3. Validate all servers and applications are back online and connected to the domain

  4. Delete the old DR VMs and their disks

export app_dr_project="REPLACE_WITH_SERVICE_PROJECT_FOR_DR_PROJECT_ID"
export zone=$(gcloud compute instances list --project=$app_dr_project --format="value(zone.basename())" | head -n 1)
for gce_instance in $(gcloud compute instances list --project=$app_dr_project --format="value(selfLink.basename())")
do
	gcloud compute instances delete $gce_instance --zone $zone --project=$app_dr_project --quiet
done
  1. Rename restage-dr-async-boot-disks.tf.failback to restage-dr-async-boot-disks.tf and restage-dr-async-rep.tf.failback to restage-dr-async-rep.tf

  2. While in the /failback folder, run the terraform commands to re-create new boot disks in the DR region, and the associated async replication pairs to prepare for the next DR event

    • terraform plan -out tf.out (should see 20 or 22 resources to add)
    • terraform apply tf.out

Note

Please allow 15-20 minutes for initial replication to complete. If using your own systems with larger disks, initial replication time may be longer. The initial replication is complete when the compute.googleapis.com/disk/async_replication/time_since_last_replication metric is available in Cloud Monitoring.

  1. # --- From the Service Project for Production ---
  2. # Open Cloud Monitoring > Metrics expolorer > Click on "< > MQL" on the top right > Paste the following MQL
  3. # If nothing loads it means that replication has not taken place yet.
  4. # You can enable auto-refresh by clicking the button right next to "SAVE CHART"
  5. fetch gce_disk
  6. | metric
  7. 'compute.googleapis.com/disk/async_replication/time_since_last_replication'
  8. | group_by 1m,
  9. [value_time_since_last_replication_mean:
  10. mean(value.time_since_last_replication)]
  11. | every 1m
  12. | group_by [],
  13. [value_time_since_last_replication_mean_aggregate:
  14. aggregate(value_time_since_last_replication_mean)]

Future DR and Failback Events

In the case of future DR events, you would follow the steps in the DR Failover section, with the following exceptions:

  1. Navigate to the \setup /failback folder and rename prod-async-rep.tf to prod-async-rep.tf.dr restage-dr-async-rep.tf to restage-dr-async-rep.tf.failback

  2. Rename stage-failback-async-boot-disks.tf.dr to stage-failback-async-boot-disks.tf and stage-failback-async-rep.tf.dr to stage-failback-async-rep.tf

And similary, to failback, you would follow the steps in the Production Failback section, with the following exceptions:

  1. Rename restage-dr-async-boot-disks.tf.failback to restage-dr-async-boot-disks.tf and restage-dr-async-rep.tf.failback to restage-dr-async-rep.tf

Cleanup

  1. Navigate to the /failback folder

    • terraform destroy
  2. Navigate to the /dr folder

    • terraform destroy
  3. Navigate to the /setup folder

    • terraform destroy
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家小花儿/article/detail/289523
推荐阅读
相关标签
  

闽ICP备14008679号