赞
踩
Before launching your critical workloads in the cloud, you need to plan for disaster recovery. A robust disaster recovery (DR) strategy can minimize business disruption by enabling rapid recovery in another region. For most traditional Windows Server environments, DR also requires careful consideration of Microsoft licensing compliance, costs, and how quickly you can recover — including the time it takes to manually remediate any domain issues on your member servers.
Google Cloud offers various options for robust protection. In this blog, we focus on Persistent Disk Asynchronous Replication (PD Async Replication), which has been generally available since June 2023.
PD Async Replication delivers quick recovery from unforeseen disasters. It replicates storage blocks across regions, achieving low Recovery Point Objective (RPO) under one minute and helps reduce Recovery Time Objective (RTO). In the unlikely event of a regional compute outage in the workload’s primary region, PD Async Replication helps ensure workload data is available in the DR region by replicating both boot and data disks. These replicated workloads can then be spun up quickly and programmatically using tools like Terraform or the gcloud SDK to minimize the business impact.Customers running Windows Server on Google Compute Engine can see particular benefit from this capability, as it minimizes licensing costs, speeds up recovery, and reduces the amount of manual intervention that might be required in a traditional DR solution. The on-demand licensing model associated with Windows Server instances incurs charges only for running virtual machines, not disks. Notably, if PD Async Replication is used with disks that are not attached to running VMs, there are no licensing costs. Therefore, limiting VM activation within the DR region solely to actual disaster scenarios (including testing) presents an opportunity to minimize licensing costs.
In the following example, we examine a small Windows Server with Active Directory environment. Let's assume there is an on-premises component running Active Directory domain controllers, but there are also domain controllers in Google Cloud. This example has a single production region in us-east4
. The DR region is designated as us-central1
. Cloud DNS is configured according to best practices for Domain Forwarding and Domain Peering for Active Directory environments. Let's take a look at a sample architecture diagram, and dissect it further:
PD Async Replication plays a crucial role in safeguarding data continuity within this DR environment. It ensures the contents of all boot and data disks, including Active Directory information, from the production region (us-east4), are mirrored onto disks in the designated DR region (us-central1). This guarantees data and Active Directory availability even if a catastrophic event renders us-east4 inaccessible.
While storage replication is paramount, recovering quickly hinges on additional configurations. In typical DR scenarios, different Classless Inter-Domain Routing (CIDR) blocks are implemented for both environments. However, this specific example differs by utilizing the same IP Subnet range for both production and DR. This strategic choice facilitates rapid Windows Server recovery by eliminating the need to reconfigure the network interface, on-premises routing adjustments, or Active Directory DNS changes. Notably, the DR-to-shared-svcs peering link remains incomplete, effectively circumventing potential IP conflict issues.
During a DR event, the following processes occur in quick succession:
Production VMs are powered off (if possible — depending on the disaster scenario)
PD Async Replication is stopped
VPC Peering from the Production environment is severed
Disaster Recovery VMs are built using the same configuration as the Production VMs
Instance configuration including Network IP will be the same
Replicated Boot and Data disks from Production VMs are used during instance creation
VPC Peering to the DR environment is established
By preserving the IP address along with the low RPO/RTO replication of the disks, the DR VMs boot up with the existing Production configuration without the need to rejoin the Domain. This also means that you can follow this same process and mirror disks in DR back to production for easy failback once your disaster scenario is over, providing a full end-to-end solution.
You can use PD Async Replication to protect a variety of Windows workloads including SQL Server; it is especially useful in cases where the use of native replication technology (such as SQL Server AlwaysOn) might be cost-prohibitive due to the additional licensing required.
You can find the Terraform code for this example deployment, along with detailed instructions on how to use it, in this repository.
This blog post introduces PD Async Replication and its key features. However, for in-depth understanding and to confirm it aligns with your needs, please review the product page. You can find pricing for PD Asynchronous Replication here.
This code will facilitate the creation of 10 VMs that will auto-join a domain, a DR failover, and a failback. The domain controller was built manually to support this solution, and should be created first if wishing to test with one. This repo contains the code to build the secondary boot disk and establish the asyncrhonous replication for the manually created domain controller. This solution uses Persistent Disk Asynchronous Replication, and requires the use of an instance template to faciliate the creation of Windows servers. A sample gcloud command is included in setup\templatefiles.
VPC networks, subnets, peerings, and Cloud DNS configurations are not provided with this code at this time, but can be referenced in the architecture diagram below.
Locals are used in each folder for ease of manipulation. These can be extracted into CSV files for customization and use within your own organiziation. For this demo, the locals contain the server names and some other information needed for DR and failback. Please review the contents and modify, if desired.
Contains code to spin up 10 Windows Server servers and join them to a domain, create DR boot disks for all 10 and the domain controller (if using) in the DR region, and create the asynchronous replication pairs for all disks. The code is written to preserve IP addresses.
Contains code to spin up DR servers using the replicated disks and IP addresses from production (including the domain controller), create failback boot disks in the production region, and create the failback async replication pairs for all disks. The IP addresses are preserved for failback.
Contains code to spin up failback/production servers using the replicated disks and IP addresses from DR, and includes code to recreate DR boot disks and async replication pairs to prepare for the next DR event.
As of 01/2024, this repo does not contain the code necessary to build out an entire environment. Some general steps and guidelines are provided here in order to help with this demo.
Note
These instructions assume that you are building out the same environment as shown in the Architecture diagram
This demo uses a Shared VPC architecture which requires the use of a Google Cloud Organization.
The following IAM Roles are required for this demo
gcloud services enable compute.googleapis.com dns.googleapis.com iap.googleapis.com
shared-svcs
with global routing
sn-shrdsvcs-us-east4
in us-east4
with IP range 10.0.0.0/21
and Private Google Access enabledsn-shrdsvcs-us-central1
in us-central1
with IP range 10.20.0.0/21
and Private Google Access enabledapp-prod
with global routing
prod-app-us-east4
in us-east4
with IP range 10.1.0.0/21
and Private Google Access enabledapp-dr
with global routing
dr-app-us-central1
in us-central1
with IP range 10.1.0.0/21
and Private Google Access enabled Note The IP range in DR is the same as Productionshared-svcs
VPC and app-prod
VPC
app-prod
VPC to the shared-svcs
VPCapp-dr
VPC to the shared-svcs
VPC to save time in a DR event, but it is not required at this time.export shared_vpc_host_project="REPLACE_WITH_SHARED_VPC_HOST_PROJECT_PROJECT_ID" # Create VPC Peering between shared-svcs and prod-vpc gcloud compute networks peerings create shared-svcs-vpc-to-prod-vpc \ --project=$shared_vpc_host_project \ --network=shared-svcs \ --peer-network=app-prod # Create VPC Peering between prod-vpc and shared-svcs gcloud compute networks peerings create prod-vpc-to-shared-svcs-vpc \ --project=$shared_vpc_host_project \ --network=app-prod \ --peer-network=shared-svcs # Run this command to verify that the new Peerings are showing as ACTIVE gcloud compute networks peerings list \ --project=$shared_vpc_host_project \ --flatten="peerings[]" \ --format="table(peerings.name,peerings.state)"
prod-app-us-east4
with the Production Project onlydr-app-us-central1
with the DR Project onlyapp-prod
VPCImportant
If you are not using a Domain Controller to test, please ensure that the use-domain-controller
variable in terraform.tfvars
is set to false
Create an Instance template in the Service Project for Production
gcloud
command has been provided in the /setup/templatefiles folder for your convenienceNavigate to the /setup folder and update the terraform.tfvars
file with the appropriate variables for your environment
ad-join.tpl
with your values.While in the /setup directory run the terraform commands
terraform init
terraform plan -out tf.out
(there should be 42 resources to add)terraform apply tf.out
The default configuration will deploy
Note
Please allow 15-20 minutes for initial replication to complete. If using your own systems with larger disks, initial replication time may be longer. The initial replication is complete when the compute.googleapis.com/disk/async_replication/time_since_last_replication
metric is available in Cloud Monitoring.
- # --- From the Service Project for Production ---
- # Open Cloud Monitoring > Metrics expolorer > Click on "< > MQL" on the top right > Paste the following MQL
- # If nothing loads it means that replication has not taken place yet.
- # You can enable auto-refresh by clicking the button right next to "SAVE CHART"
-
- fetch gce_disk
- | metric
- 'compute.googleapis.com/disk/async_replication/time_since_last_replication'
- | group_by 1m,
- [value_time_since_last_replication_mean:
- mean(value.time_since_last_replication)]
- | every 1m
- | group_by [],
- [value_time_since_last_replication_mean_aggregate:
- aggregate(value_time_since_last_replication_mean)]
Important
If you are not using a Domain Controller to test, please ensure that the use-domain-controller
variable in terraform.tfvars
is set to false
export app_prod_project="REPLACE_WITH_SERVICE_PROJECT_FOR_PRODUCTION_PROJECT_ID" export zone=$(gcloud compute instances list --project=$app_prod_project --format="value(zone.basename())" | head -n 1) for gce_instance in $(gcloud compute instances list --project=$app_prod_project --format="value(selfLink.basename())") do gcloud compute instances stop $gce_instance --zone $zone --project=$app_prod_project done
Navigate to the /setup folder and rename prod-async-rep.tf
to prod-async-rep.tf.dr
. While in the /setup directory run the terraform commands to stop the asynchronous replication.
terraform plan -out tf.out
(there should be 10 or 11 resources to destroy)terraform apply tf.out
Sever the Peering from the shared-svcs
VPC to the app-prod
VPC and establish a VPC Peering from the shared-svcs
VPC to the app-dr
VPC
export shared_vpc_host_project="REPLACE_WITH_SHARED_VPC_HOST_PROJECT_PROJECT_ID" # Sever the Peering to prod-vpc gcloud compute networks peerings delete "shared-svcs-vpc-to-prod-vpc" \ --project=$shared_vpc_host_project \ --network=shared-svcs # Create VPC Peering between shared-svcs and dr-vpc gcloud compute networks peerings create shared-svcs-vpc-to-dr-vpc \ --project=$shared_vpc_host_project \ --network=shared-svcs \ --peer-network=app-dr # Create VPC Peering between dr-vpc and shared-svcs gcloud compute networks peerings create dr-vpc-to-shared-svcs-vpc \ --project=$shared_vpc_host_project \ --network=app-dr \ --peer-network=shared-svcs # Run this command to verify that the new Peerings are showing as ACTIVE gcloud compute networks peerings list \ --project=$shared_vpc_host_project \ --flatten="peerings[]" \ --format="table(peerings.name,peerings.state)"
Navigate to the /dr folder and update the terraform.tfvars
file with the appropriate variables for your environment
While in the /dr folder, run the terraform commands to create the DR VMs using the replicated disks from Production
terraform init
terraform plan -out tf.out
(there should be 10 or 11 resources to create)terraform apply tf.out
Validate all servers and applications are back online and connected to the domain
Delete the old production VMs and their disks
export app_prod_project="REPLACE_WITH_SERVICE_PROJECT_FOR_PRODUCTION_PROJECT_ID" export zone=$(gcloud compute instances list --project=$app_prod_project --format="value(zone.basename())" | head -n 1) for gce_instance in $(gcloud compute instances list --project=$app_prod_project --format="value(selfLink.basename())") do gcloud compute instances delete $gce_instance --zone $zone --project=$app_prod_project --quiet done
Rename stage-failback-async-boot-disks.tf.dr
to stage-failback-async-boot-disks.tf
and stage-failback-async-rep.tf.dr
to stage-failback-async-rep.tf
While in the /dr folder, run the terraform commands to create new boot disks in the Production region for failback, and the associated async replication pairs from DR
terraform plan -out tf.out
(should see 22 resources to add)terraform apply tf.out
Note
Please allow 15-20 minutes for initial replication to complete. If using your own systems with larger disks, initial replication time may be longer. The initial replication is complete when the compute.googleapis.com/disk/async_replication/time_since_last_replication
metric is available in Cloud Monitoring.
- # --- From the Service Project for DR ---
- # Open Cloud Monitoring > Metrics expolorer > Click on "< > MQL" on the top right > Paste the following MQL
- # If nothing loads it means that replication has not taken place yet.
- # You can enable auto-refresh by clicking the button right next to "SAVE CHART"
-
- fetch gce_disk
- | metric
- 'compute.googleapis.com/disk/async_replication/time_since_last_replication'
- | group_by 1m,
- [value_time_since_last_replication_mean:
- mean(value.time_since_last_replication)]
- | every 1m
- | group_by [],
- [value_time_since_last_replication_mean_aggregate:
- aggregate(value_time_since_last_replication_mean)]
Navigate to the /failback folder and update the terraform.tfvars
file with the appropriate variables for your environment to prepare for production failback
Optional While in the /failback folder, run the terraform commands to prepare for failback
terraform init
terraform plan -out tf.out
(there should be 10 or 11 resources to create)Important
If you are not using a Domain Controller to test, please ensure that the use-domain-controller
variable in terraform.tfvars
is set to false
export app_dr_project="REPLACE_WITH_SERVICE_PROJECT_FOR_DR_PROJECT_ID" export zone=$(gcloud compute instances list --project=$app_dr_project --format="value(zone.basename())" | head -n 1) for gce_instance in $(gcloud compute instances list --project=$app_dr_project --format="value(selfLink.basename())") do gcloud compute instances stop $gce_instance --zone $zone --project=$app_dr_project done
Navigate to the /dr folder
Rename stage-failback-async-rep.tf
to stage-failback-async-rep.tf.dr
While in the /dr folder, run the terraform commands to stop replication
terraform plan -out tf.out
(should see 11 resources to destroy)terraform apply tf.out
In the console, sever VPC peering from shared-svcs
to app-dr
, and establish VPC peering from shared-svcs
to app-prod
export shared_vpc_host_project="REPLACE_WITH_SHARED_VPC_HOST_PROJECT_PROJECT_ID" # Sever the Peering to dr-vpc gcloud compute networks peerings delete "shared-svcs-vpc-to-dr-vpc" \ --project=$shared_vpc_host_project \ --network=shared-svcs # Create VPC Peering between shared-svcs and prod-vpc gcloud compute networks peerings create shared-svcs-vpc-to-prod-vpc \ --project=$shared_vpc_host_project \ --network=shared-svcs \ --peer-network=app-prod # Run this command to verify that the new Peerings are showing as ACTIVE gcloud compute networks peerings list \ --project=$shared_vpc_host_project \ --flatten="peerings[]" \ --format="table(peerings.name,peerings.state)"
Navigate to the /failback folder
While in the /failback folder, run the terraform commands to recover your VMs in the original production region using the replicated disks from DR
terraform plan -out tf.out
(should see 10 or 11 resources to add)terraform apply tf.out
Validate all servers and applications are back online and connected to the domain
Delete the old DR VMs and their disks
export app_dr_project="REPLACE_WITH_SERVICE_PROJECT_FOR_DR_PROJECT_ID" export zone=$(gcloud compute instances list --project=$app_dr_project --format="value(zone.basename())" | head -n 1) for gce_instance in $(gcloud compute instances list --project=$app_dr_project --format="value(selfLink.basename())") do gcloud compute instances delete $gce_instance --zone $zone --project=$app_dr_project --quiet done
Rename restage-dr-async-boot-disks.tf.failback
to restage-dr-async-boot-disks.tf
and restage-dr-async-rep.tf.failback
to restage-dr-async-rep.tf
While in the /failback folder, run the terraform commands to re-create new boot disks in the DR region, and the associated async replication pairs to prepare for the next DR event
terraform plan -out tf.out
(should see 20 or 22 resources to add)terraform apply tf.out
Note
Please allow 15-20 minutes for initial replication to complete. If using your own systems with larger disks, initial replication time may be longer. The initial replication is complete when the compute.googleapis.com/disk/async_replication/time_since_last_replication
metric is available in Cloud Monitoring.
- # --- From the Service Project for Production ---
- # Open Cloud Monitoring > Metrics expolorer > Click on "< > MQL" on the top right > Paste the following MQL
- # If nothing loads it means that replication has not taken place yet.
- # You can enable auto-refresh by clicking the button right next to "SAVE CHART"
-
- fetch gce_disk
- | metric
- 'compute.googleapis.com/disk/async_replication/time_since_last_replication'
- | group_by 1m,
- [value_time_since_last_replication_mean:
- mean(value.time_since_last_replication)]
- | every 1m
- | group_by [],
- [value_time_since_last_replication_mean_aggregate:
- aggregate(value_time_since_last_replication_mean)]
In the case of future DR events, you would follow the steps in the DR Failover section, with the following exceptions:
Navigate to the \setup /failback folder and rename prod-async-rep.tf
to prod-async-rep.tf.dr
restage-dr-async-rep.tf
to restage-dr-async-rep.tf.failback
Rename stage-failback-async-boot-disks.tf.dr
to stage-failback-async-boot-disks.tf
and stage-failback-async-rep.tf.dr
to stage-failback-async-rep.tf
And similary, to failback, you would follow the steps in the Production Failback section, with the following exceptions:
restage-dr-async-boot-disks.tf.failback
to restage-dr-async-boot-disks.tf
and restage-dr-async-rep.tf.failback
to restage-dr-async-rep.tf
Navigate to the /failback folder
terraform destroy
Navigate to the /dr folder
terraform destroy
Navigate to the /setup folder
terraform destroy
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。