Oracle Cloud VMware Solution – Disaster Recovery

This blog summarizes the general principles that define a disaster recovery architecture and strategy to protect VMware based workloads that are hosted on-premises, within a data center or a VMware based cloud solution in case a disaster should strike these environments with the use of Oracle Cloud VMware Solution (OCVS) as the recovery site

OCVS general overview

Oracle Cloud VMware Solution (OCVS) is a native OCI service that provisions a VMware SDDC (Software Defined Data Center) in a customer’s tenancy. The customer provides all the required networking infrastructure such as VCN, SDDC, and Workload CIDRs to successfully deploy the OCVS SDDC. As a result, OCVS creates compute instances on behalf of the customer, connects VNICs to Subnets & VLANs, and installs VMware components to create a VMware SDDC.

OCVS currently uses BM DenseIO2.52 (Intel) and BM.DenseIO.E4.128 (AMD) compute shapes to create a VMware SDDC and offers an option to deploy SDDC with vSphere versions 6.5, 6.7 & 7.0.

NOTE: The general support for vSphere versions 6.5 & 6.7 will end on October 15, 2022. The OCVS SDDC follows VMware Cloud Provider Stack (VCPS) to maintain the versions of SDDC components. The OCVS SDDC offers below software products from VMware. 

  • VMware vSphere
    • VMware vCenter Server Standard
    • VMware Hypervisor (ESXi Enterprise)
  • VMware vSAN Enterprise
  • VMware NSX-T Data Center Enterprise Plus
  • VMware HCX Advanced 

HCX Enterprise is optional and can be subscribed monthly

Disaster Recovery Scenarios

This section provides a high-level overview of the disaster recovery solutions that can be implemented with the use of Oracle Cloud VMware Solution (OCVS).

  • VMware – Remote to Cloud
  • VMware – Cloud to Cloud
  • Non-VMware – Remote to Cloud
  • VMware – OCVS to OCVS

NOTE: For disaster recovery solutions for VMware based workloads, this blog post will mainly focus on VMware vSphere Replication with VMware Site Recovery Manager (SRM). As we also cover disaster recovery solutions from non-VMware based workloads, (Non-VMware Hypervisors and Bare Metal Solutions) this blog post is also focusing on Rackware Disaster Recovery for that purpose.

VMware HCX is also a possible disaster recovery solution that can be leveraged for small sized disaster recovery solutions that do not require a high level of orchestration.

However, in the end it is up to the customer, partner, or managed services provider to pick the tool of choice for the disaster recovery solution, the only requirement is that the disaster recovery tooling must be compatible and supported by VMware to guarantee a successful replication of VMs to Oracle Cloud VMware Solution (OCVS).

VMware – Remote to Cloud

For VMware workloads that are running on-premises, in a data center or in a remote location, VMware Site Recovery Manager with VMware vSphere Replication or any other VMware supported disaster recovery solution can be leveraged to replicate and protect workloads to Oracle Cloud VMware Solution (OCVS).

Figure 1 – VMware – Remote to Cloud

VMware – Cloud to Cloud

For VMware workloads that are running on a VMware based public cloud solution, VMware Site Recovery Manager with VMware vSphere Replication or any other VMware supported disaster recovery solution can be leveraged to replicate and protect workloads to Oracle Cloud VMware Solution (OCVS).

Figure 2 – VMware – Cloud to Cloud

Non-VMware – Remote to Cloud

For non-VMware workloads for example VMs running on XEN Server, Hyper-V or KVM and especially Bare Metal Servers, Rackware is a disaster recovery solution that can be leveraged to replicate non-VMware VMs and Bare Metal instances to Oracle Cloud VMware Solution (OCVS).

Figure 3 – Non-VMware – Remote to Cloud

 VMware – OCVS to OCVS

To protect VMware based workloads running on Oracle Cloud VMware Solution (OCVS) against and outage of an entire OCI Region, the OCI Backbone can be leveraged to replicate the VMs to an OCVS disaster recovery SDDC in another OCI Region.

Figure 4 – VMware – OCVS to OCVS

 Disaster Recovery Solutions

There are many disaster recovery solutions in the market that provide protection of mission critical workloads, beside the solutions from VMware (VMware vSphere Replication with VMware Site Recovery Manager) and Rackware for VMware and non-VMware based workloads, OCVS supports any VMware supported disaster recovery solution to protect workloads from any location to OCVS, a few of them are mentioned below in the 3rd Party Solutions section.

VMware Solutions

3rd Party Solutions

Considerations

When it comes to designing a disaster recovery solution, many factors that have in influence on the solution must be taken into consideration to guarantee  a successful replication and recovery process of the critical workloads.

  • Identify workloads that require disaster recovery protection
  • RPO and RTO of workloads that require disaster recovery protection
  • Network Bandwidth requirements to replicate workloads
  • Network Failover
  • OCVS disaster recovery sizing & scaling
  • Oracle Databases

Identify workloads the require disaster recovery protection

Identifying the workloads that require disaster recovery protection is the first and most important task as probably not all workloads require disaster recovery protection like test and staging systems for example. Disaster recovery is always cost intensive, so the more resources in terms on vCPU, vMEM and Storage you can save by creating detailed inventory lists of the workloads to protect will save you money when it comes to sizing, scaling, and network requirements of the disaster recovery solution. Also keep in mind that during a disruptive event it is maybe not required to bring back online all systems, it scenarios like that mostly only mission and business critical systems will be recovered to keep the core business running.

RPO and RTO of disaster recovery workloads

The PRO and RTO requirements of the workloads to protect will have an impact on the network bandwidth, as these factors define how often and within what time frame the data will be replicated to the protected site to meet the data restoration time and the maximum time to recover from a disruptive event.

Figure 5 – RPO & RTO

Recovery point objective (RPO) is defined as the maximum amount of data – as measured by time – that can be lost after a recovery from a disaster, failure, or comparable event before data loss will exceed what is acceptable to an organization.

Recovery time objective (RTO) is the amount of real time a business must restore its processes at an acceptable service level after a disaster to avoid intolerable consequences associated with the disruption.

 

Network Bandwidth requirements

As already mentioned in the RPO and RTO section, network bandwidth is one of the major topics when it comes to disaster recovery solutions as the bandwidth defines how much data can be replicated in a given amount of time.

VMware vSphere Replication Calculator

This tool from VMware can help you to calculate the required bandwidth to replicate your workloads to OCVS by comparing the following parameters:

  • Total number of VMs to be replicated
  • Average number of virtual disks per VM
  • Average size of virtual disks (GB)
  • Average capacity utilization of virtual disks (percent)
  • Replication compression enabled
  • Average daily data change rate (percent)
  • Largest data change burst (percent)
  • Average recovery point objective (RPO) in minutes

VMware vSphere Replication – Calculator

VMware vSphere Replication can be configured to compress the data that it transfers through the network. Compressing the replication data that is transferred through the network saves network bandwidth and might help reduce the amount of buffer memory used on the vSphere Replication server. However, compressing and decompressing data requires more CPU resources on both the source site and the server that manages the target datastore. 

Network Failover

One of the most essential parts of a disaster recovery is the network failover in case a disruptive event occurs at the primary site and the systems cannot be brought back online within a given amount of time in the primary location due to a data center outage or network connectivity problem.

To protect mission critical systems and to provide access to the workloads for the end-users and administrators it is key to have a solid network concept to make sure that end-users and administrators located in the headquarter, remote locations or home office users will have access to the workloads in a disaster recovery scenario right after the restoration process and network failover to the disaster recovery site.

Network Interconnectivity

  • Redundant FastConnect or IPSec VPN connection from Customer Locations to Customer Data Center with failover detection
  • Redundant FastConnect or IPSec VPN connection from Customer Locations to Oracle Cloud Infrastructure with failover detection
  • Redundant FastConnect or IPSec VPN connection from Customer Data Center to Oracle Cloud Infrastructure with failover detection

Figure 6 – Network Failover

Disaster Recovery Restoration Plan

The disaster recovery restoration plan will be executed when the disaster scenario strikes, and it will not be possible to bring the primary site back online in a given amount of time.

VMware SRM in this case will start the restoration plan and start up the VMs and reconfigures networking and DNS of the VMs to work within the disaster recovery environment. This process must be configured and tested several times a year as part of the disaster recovery test plans to verify a successful execution when a disaster should strike.

 OCVS disaster recovery sizing & scaling

The right sizing of your Oracle Cloud VMware Solution (OCVS) deployment is another essential part of the disaster recovery strategy that benefits from a well-defined disaster recovery strategy when it comes to the number of VMs, RPO and RTO requirements and priority of VMs to start up during a disruptive event.

For more detailed information about vSAN sizing and scaling, please visit the following blog post: Oracle Cloud VMware Solution – vSAN Sizing & Scaling

Example

A customer has the requirement to replicate all VMs in his environment with the following resource usage to Oracle Cloud VMware Solution (OCVS):

  • VMs:                            400 VMs
  • vCPUs total:                500 vCPUs
  • vMEM total:                7000 GB
  • Storage total:              160 TB

Production, Test and Development VMs will be replicated to OCVS, however, in a disaster recovery event the customer only requires the Production VMs to be powered on at the disaster recovery site. These VMs only require 250 vCPUs, 4500 GB vMEM and 110 TB of Storage.

For this scenario the base OCVS deployment with a 3 Node Cluster as shown in the diagram is fully sufficient as the disaster recovery workloads only require a subset of the available resources. As the storage requirement for all VMs is higher than the 129 TB that the OCVS vSAN Cluster can offer, a OCI Block Volume was added as a additional VMFS datastore to cover the storage requirements for all VMs, this eliminates the requirement to provision another OCVS ESXi host to extend storage capacity. 

Figure 7 – Sizing & Scaling

If the vCPU and vMEM resources will not be sufficient in the future due to growth of the source environment, a possible solution is to reserve OCVS ESXi instances to be ready to scale the environment on demand in a disaster recovery scenario as the general replication of the VMs might not require a vCPU und vMEM expansion by adding another OCVS ESXi Host. OCI Block Volumes can also be added for future storage expansion without the need to scale up OCVS ESXi Hosts.

Oracle Databases

When it comes to disaster recovery concepts for Oracle Databases from a customer’s VMware environment to Oracle Cloud VMware Solution (OCVS), some challenges will come into the picture, like Oracle Database Licensing on OCVS and separation of OCVS Clusters for Oracle Databases, Database Managment Overhead, Maintenence Operations, Disaster Recovery Tests etc.

The best approach to overcome these challenges is to replicate all VMs to OCVS except the Oracle Database VMs. The Oracle Databases can be replicated via Oracle Data Guard or Golden Gate to native OCI Database services like DBCS, ExaCS or Autonomous Database.

This eliminates the requirement for a separate Oracle cluster within the SDDC and reduces licensing costs and brings a lot of operational advantages.

Figure 8 – Oracle Databases

Conclusion

This blog summarizes the general principles that define a disaster recovery architecture and strategy to protect VMware based workloads that are hosted on-premises, within a data center or a VMware based cloud solution in case a disaster should strike these environments with the use of Oracle Cloud VMware Solution (OCVS) as the recovery site.

If you want to read more on this topic, the following links give you more insight about VMware on Oracle Cloud Infrastructure:

Leave a Reply

Your email address will not be published. Required fields are marked *