This blog summarizes the general principles that define a disaster recovery architecture and strategy to protect VMware based workloads that are hosted on-premises, within a data center or a VMware based cloud solution in case a disaster should strike these environments with the use of Oracle Cloud VMware Solution (OCVS) as the recovery site
OCVS general overview
Oracle Cloud VMware Solution (OCVS) is a native OCI service that provisions a VMware SDDC (Software Defined Data Center) in a customer’s tenancy. The customer provides all the required networking infrastructure such as VCN, SDDC, and Workload CIDRs to successfully deploy the OCVS SDDC. As a result, OCVS creates compute instances on behalf of the customer, connects VNICs to Subnets & VLANs, and installs VMware components to create a VMware SDDC.
OCVS currently uses BM DenseIO2.52 (Intel) and BM.DenseIO.E4.128 (AMD) compute shapes to create a VMware SDDC and offers an option to deploy SDDC with vSphere versions 6.5, 6.7 & 7.0.
NOTE: The general support for vSphere versions 6.5 & 6.7 will end on October 15, 2022. The OCVS SDDC follows VMware Cloud Provider Stack (VCPS) to maintain the versions of SDDC components. The OCVS SDDC offers below software products from VMware.
- VMware vSphere
- VMware vCenter Server Standard
- VMware Hypervisor (ESXi Enterprise)
- VMware vSAN Enterprise
- VMware NSX-T Data Center Enterprise Plus
- VMware HCX Advanced
HCX Enterprise is optional and can be subscribed monthly
Disaster Recovery Scenarios
This section provides a high-level overview of the disaster recovery solutions that can be implemented with the use of Oracle Cloud VMware Solution (OCVS).
- VMware – Remote to Cloud
- VMware – Cloud to Cloud
- Non-VMware – Remote to Cloud
- VMware – OCVS to OCVS
NOTE: For disaster recovery solutions for VMware based workloads, this blog post will mainly focus on VMware vSphere Replication with VMware Site Recovery Manager (SRM). As we also cover disaster recovery solutions from non-VMware based workloads, (Non-VMware Hypervisors and Bare Metal Solutions) this blog post is also focusing on Rackware Disaster Recovery for that purpose.
VMware HCX is also a possible disaster recovery solution that can be leveraged for small sized disaster recovery solutions that do not require a high level of orchestration.
However, in the end it is up to the customer, partner, or managed services provider to pick the tool of choice for the disaster recovery solution, the only requirement is that the disaster recovery tooling must be compatible and supported by VMware to guarantee a successful replication of VMs to Oracle Cloud VMware Solution (OCVS).
VMware – Remote to Cloud
For VMware workloads that are running on-premises, in a data center or in a remote location, VMware Site Recovery Manager with VMware vSphere Replication or any other VMware supported disaster recovery solution can be leveraged to replicate and protect workloads to Oracle Cloud VMware Solution (OCVS).
Figure 1 – VMware – Remote to Cloud
VMware – Cloud to Cloud
For VMware workloads that are running on a VMware based public cloud solution, VMware Site Recovery Manager with VMware vSphere Replication or any other VMware supported disaster recovery solution can be leveraged to replicate and protect workloads to Oracle Cloud VMware Solution (OCVS).
Figure 2 – VMware – Cloud to Cloud
Non-VMware – Remote to Cloud
For non-VMware workloads for example VMs running on XEN Server, Hyper-V or KVM and especially Bare Metal Servers, Rackware is a disaster recovery solution that can be leveraged to replicate non-VMware VMs and Bare Metal instances to Oracle Cloud VMware Solution (OCVS).
Figure 3 – Non-VMware – Remote to Cloud
VMware – OCVS to OCVS
To protect VMware based workloads running on Oracle Cloud VMware Solution (OCVS) against and outage of an entire OCI Region, the OCI Backbone can be leveraged to replicate the VMs to an OCVS disaster recovery SDDC in another OCI Region.
Figure 4 – VMware – OCVS to OCVS
Disaster Recovery Solutions
There are many disaster recovery solutions in the market that provide protection of mission critical workloads, beside the solutions from VMware (VMware vSphere Replication with VMware Site Recovery Manager) and Rackware for VMware and non-VMware based workloads, OCVS supports any VMware supported disaster recovery solution to protect workloads from any location to OCVS, a few of them are mentioned below in the 3rd Party Solutions section.
VMware Solutions
3rd Party Solutions
Considerations
When it comes to designing a disaster recovery solution, many factors that have in influence on the solution must be taken into consideration to guarantee a successful replication and recovery process of the critical workloads.
- Identify workloads that require disaster recovery protection
- RPO and RTO of workloads that require disaster recovery protection
- Network Bandwidth requirements to replicate workloads
- Network Failover
- OCVS disaster recovery sizing & scaling
- Oracle Databases
Identify workloads the require disaster recovery protection
Identifying the workloads that require disaster recovery protection is the first and most important task as probably not all workloads require disaster recovery protection like test and staging systems for example. Disaster recovery is always cost intensive, so the more resources in terms on vCPU, vMEM and Storage you can save by creating detailed inventory lists of the workloads to protect will save you money when it comes to sizing, scaling, and network requirements of the disaster recovery solution. Also keep in mind that during a disruptive event it is maybe not required to bring back online all systems, it scenarios like that mostly only mission and business critical systems will be recovered to keep the core business running.
RPO and RTO of disaster recovery workloads
The PRO and RTO requirements of the workloads to protect will have an impact on the network bandwidth, as these factors define how often and within what time frame the data will be replicated to the protected site to meet the data restoration time and the maximum time to recover from a disruptive event.
Figure 5 – RPO & RTO
Recovery point objective (RPO) is defined as the maximum amount of data – as measured by time – that can be lost after a recovery from a disaster, failure, or comparable event before data loss will exceed what is acceptable to an organization.
Recovery time objective (RTO) is the amount of real time a business must restore its processes at an acceptable service level after a disaster to avoid intolerable consequences associated with the disruption.
Network Bandwidth requirements
As already mentioned in the RPO and RTO section, network bandwidth is one of the major topics when it comes to disaster recovery solutions as the bandwidth defines how much data can be replicated in a given amount of time.
VMware vSphere Replication Calculator
This tool from VMware can help you to calculate the required bandwidth to replicate your workloads to OCVS by comparing the following parameters:
- Total number of VMs to be replicated
- Average number of virtual disks per VM
- Average size of virtual disks (GB)
- Average capacity utilization of virtual disks (percent)
- Replication compression enabled
- Average daily data change rate (percent)
- Largest data change burst (percent)
- Average recovery point objective (RPO) in minutes
VMware vSphere Replication – Calculator
VMware vSphere Replication can be configured to compress the data that it transfers through the network. Compressing the replication data that is transferred through the network saves network bandwidth and might help reduce the amount of buffer memory used on the vSphere Replication server. However, compressing and decompressing data requires more CPU resources on both the source site and the server that manages the target datastore.
Network Failover
One of the most essential parts of a disaster recovery is the network failover in case a disruptive event occurs at the primary site and the systems cannot be brought back online within a given amount of time in the primary location due to a data center outage or network connectivity problem.
To protect mission critical systems and to provide access to the workloads for the end-users and administrators it is key to have a solid network concept to make sure that end-users and administrators located in the headquarter, remote locations or home office users will have access to the workloads in a disaster recovery scenario right after the restoration process and network failover to the disaster recovery site.
Network Interconnectivity
- Redundant FastConnect or IPSec VPN connection from Customer Locations to Customer Data Center with failover detection
- Redundant FastConnect or IPSec VPN connection from Customer Locations to Oracle Cloud Infrastructure with failover detection
- Redundant FastConnect or IPSec VPN connection from Customer Data Center to Oracle Cloud Infrastructure with failover detection
Figure 6 – Network Failover
Disaster Recovery Restoration Plan
The disaster recovery restoration plan will be executed when the disaster scenario strikes, and it will not be possible to bring the primary site back online in a given amount of time.
VMware SRM in this case will start the restoration plan and start up the VMs and reconfigures networking and DNS of the VMs to work within the disaster recovery environment. This process must be configured and tested several times a year as part of the disaster recovery test plans to verify a successful execution when a disaster should strike.
OCVS disaster recovery sizing & scaling
The right sizing of your Oracle Cloud VMware Solution (OCVS) deployment is another essential part of the disaster recovery strategy that benefits from a well-defined disaster recovery strategy when it comes to the number of VMs, RPO and RTO requirements and priority of VMs to start up during a disruptive event.
For more detailed information about vSAN sizing and scaling, please visit the following blog post: Oracle Cloud VMware Solution – vSAN Sizing & Scaling
Example
A customer has the requirement to replicate all VMs in his environment with the following resource usage to Oracle Cloud VMware Solution (OCVS):
- VMs: 400 VMs
- vCPUs total: 500 vCPUs
- vMEM total: 7000 GB
- Storage total: 160 TB
Production, Test and Development VMs will be replicated to OCVS, however, in a disaster recovery event the customer only requires the Production VMs to be powered on at the disaster recovery site. These VMs only require 250 vCPUs, 4500 GB vMEM and 110 TB of Storage.
For this scenario the base OCVS deployment with a 3 Node Cluster as shown in the diagram is fully sufficient as the disaster recovery workloads only require a subset of the available resources. As the storage requirement for all VMs is higher than the 129 TB that the OCVS vSAN Cluster can offer, a OCI Block Volume was added as a additional VMFS datastore to cover the storage requirements for all VMs, this eliminates the requirement to provision another OCVS ESXi host to extend storage capacity.
Figure 7 – Sizing & Scaling
If the vCPU and vMEM resources will not be sufficient in the future due to growth of the source environment, a possible solution is to reserve OCVS ESXi instances to be ready to scale the environment on demand in a disaster recovery scenario as the general replication of the VMs might not require a vCPU und vMEM expansion by adding another OCVS ESXi Host. OCI Block Volumes can also be added for future storage expansion without the need to scale up OCVS ESXi Hosts.
Oracle Databases
When it comes to disaster recovery concepts for Oracle Databases from a customer’s VMware environment to Oracle Cloud VMware Solution (OCVS), some challenges will come into the picture, like Oracle Database Licensing on OCVS and separation of OCVS Clusters for Oracle Databases, Database Managment Overhead, Maintenence Operations, Disaster Recovery Tests etc.
The best approach to overcome these challenges is to replicate all VMs to OCVS except the Oracle Database VMs. The Oracle Databases can be replicated via Oracle Data Guard or Golden Gate to native OCI Database services like DBCS, ExaCS or Autonomous Database.
This eliminates the requirement for a separate Oracle cluster within the SDDC and reduces licensing costs and brings a lot of operational advantages.
Figure 8 – Oracle Databases
Conclusion
This blog summarizes the general principles that define a disaster recovery architecture and strategy to protect VMware based workloads that are hosted on-premises, within a data center or a VMware based cloud solution in case a disaster should strike these environments with the use of Oracle Cloud VMware Solution (OCVS) as the recovery site.
If you want to read more on this topic, the following links give you more insight about VMware on Oracle Cloud Infrastructure: