vSphere High Availability (HA)
vSphere HA is a cluster service that provides high availability for the virtual machines running in the cluster. You can enable vSphere High Availability (HA) on a vSphere cluster to provide rapid recovery from outages and cost-effective high availability for applications running in virtual machines. vSphere HA provides application availability in the following ways:
It protects against server failure by restarting the virtual machines on other hosts in the cluster when a host failure is detected, as illustrated in Figure 4-2.
FIGURE 4-2 vSphere HA Host Failover
It protects against application failure by continuously monitoring a virtual machine and resetting it if a failure is detected.
It protects against datastore accessibility failures by restarting affected virtual machines on other hosts that still have access to their datastores.
It protects virtual machines against network isolation by restarting them if their host becomes isolated on the management or vSAN network. This protection is provided even if the network has become partitioned.
Benefits of vSphere HA over traditional failover solutions include the following:
Minimal configuration
Reduced hardware cost
Increased application availability
DRS and vMotion integration
vSphere HA can detect the following types of host issues:
Failure: A host stops functioning.
Isolation: A host cannot communicate with any other hosts in the cluster.
Partition: A host loses network connectivity with the primary host.
When you enable vSphere HA on a cluster, the cluster elects one of the hosts to act as the primary host. The primary host communicates with vCenter Server to report cluster health. It monitors the state of all protected virtual machines and secondary hosts. It uses network and datastore heartbeating to detect failed hosts, isolation, and network partitions. vSphere HA takes appropriate actions to respond to host failures, host isolation, and network partitions. For host failures, the typical reaction is to restart the failed virtual machines on surviving hosts in the cluster. If a network partition occurs, a primary host is elected in each partition. If a specific host is isolated, vSphere HA takes the predefined host isolation action, which may be to shut down or power down the host’s virtual machines. If the primary host fails, the surviving hosts elect a new primary host. You can configure vSphere to monitor and respond to virtual machine failures, such as guest OS failures, by monitoring heartbeats from VMware Tools.
vSphere HA Requirements
When planning a vSphere HA cluster, you need to address the following requirements:
The cluster must have at least two hosts, licensed for vSphere HA.
Hosts must use static IP addresses or guarantee that IP addresses assigned by DHCP persist across host reboots.
Each host must have at least one—and preferably two—management networks in common.
To ensure that virtual machines can run any host in the cluster, the hosts must access the networks and datastores.
To use VM Monitoring, you need to install VMware Tools in each virtual machine.
IPv4 or IPv6 can be used.
vSphere HA Response to Failures
You can configure how a vSphere HA cluster should respond to different types of failures, as described in Table 4-7.
Table 4-7 vSphere HA Response to Failure Settings
Option |
Description |
---|---|
Host Failure Response > Failure Response |
If Enabled, the cluster responds to host failures by restarting virtual machines. If Disabled, host monitoring is turned off, and the cluster does not respond to host failures. |
Host Failure Response > Default VM Restart Priority |
You can indicate the order in which virtual machines are restarted when the host fails (higher priority machines first). |
Host Failure Response > VM Restart Priority Condition |
This condition must be met before HA restarts the next priority group. |
Response for Host Isolation |
You can indicate the action that you want to occur if a host becomes isolated. You can choose Disabled, Shutdown and Restart VMs, or Power Off and Restart VMs. |
VM Monitoring |
You can indicate the sensitivity (Low, High, or Custom) with which vSphere HA responds to lost VMware Tools heartbeats. |
Application Monitoring |
You can indicate the sensitivity (Low, High, or Custom) with which vSphere HA responds to lost application heartbeats. |
Heartbeats
The primary host and secondary hosts exchange network heartbeats every second. When the primary host stops receiving these heartbeats from a secondary host, it checks for ping responses or the presence of datastore heartbeats from the secondary host. If the primary host does not receive a response after checking for a secondary host’s network heartbeat, ping, or datastore heartbeats, it declares that the secondary host has failed. If the primary host detects datastore heartbeats for a secondary host but no network heartbeats or ping responses, it assumes that the secondary host is isolated or in a network partition.
If any host is running but no longer observes network heartbeats, it attempts to ping the set of cluster isolation addresses. If those pings also fail, the host declares itself to be isolated from the network.
vSphere HA Admission Control
vSphere uses admission control when you power on a virtual machine. It checks the amount of unreserved compute resources and determines whether it can guarantee that any reservation configured for the virtual machine is configured. If so, it allows the virtual machine to power on. Otherwise, it generates an “Insufficient Resources” warning.
vSphere HA Admission Control is a setting that you can use to specify whether virtual machines can be started if they violate availability constraints. The cluster reserves resources so that failover can occur for all running virtual machines on the specified number of hosts. When you configure vSphere HA admission control, you can set options described in Table 4-8.
Table 4-8 vSphere HA Admission Control Options
Option |
Description |
---|---|
Host Failures Cluster Tolerates |
Specifies the maximum number of host failures for which the cluster guarantees failover |
Define Host Failover Capacity By set to Cluster Resource Percentage |
Specifies the percentage of the cluster’s compute resources to reserve as spare capacity to support failovers |
Define Host Failover Capacity By set to Slot Policy (powered-on VMs) |
Specifies a slot size policy that covers all powered-on VMs |
Define Host Failover Capacity By set to Dedicated Failover Hosts |
Specifies the designated hosts to use for failover actions |
Define Host Failover Capacity By set to Disabled |
Disables admission control |
Performance Degradation VMs Tolerate |
Specifies the percentage of performance degradation the VMs in a cluster are allowed to tolerate during a failure |
If you disable vSphere HA admission control, then you enable the cluster to allow virtual machines to power on regardless of whether they violate availability constraints. In the event of a host failover, you may discover that vSphere HA cannot start some virtual machines.
In vSphere 6.5, the default Admission Control setting is Cluster Resource Percentage, which reserves a percentage of the total available CPU and memory resources in the cluster. For simplicity, the percentage is calculated automatically by defining the number of host failures to tolerate (FTT). The percentage is dynamically changed as hosts are added to or removed from the cluster. Another new enhancement is the Performance Degradation VMs Tolerate setting, which controls the amount of performance reduction that is tolerated after a failure. A value of 0% indicates that no performance degradation is tolerated.
With the Slot Policy option, vSphere HA admission control ensures that a specified number of hosts can fail, leaving sufficient resources in the cluster to accommodate the failover of the impacted virtual machines. Using the Slot Policy option, when you perform certain operations, such as powering on a virtual machine, vSphere HA applies admission control in the following manner:
Step 1. HA calculates the slot size, which is a logical representation of memory and CPU resources. By default, it is sized to satisfy the requirements for any powered-on virtual machine in the cluster. For example, it is sized to accommodate the virtual machine with the greatest CPU reservation and the virtual machine with the greatest memory reservation.
Step 2. HA determines how many slots each host in the cluster can hold.
Step 3. HA determines the current failover capacity of the cluster, which is the number of hosts that can fail and still leave enough slots to satisfy all the powered-on virtual machines.
Step 4. HA determines whether the current failover capacity is less than the configured failover capacity (provided by the user).
Step 5. If the current failover capacity is less than the configured failover capacity, admission control disallows the operation.
If a cluster has a few virtual machines that have much larger reservations than the others, they will distort slot size calculation. To remediate this, you can specify an upper bound for the CPU or memory component of the slot size by using advanced options. You can also set a specific slot size (CPU size and memory size). The next section describes the advanced options that affect the slot size.
vSphere HA Advanced Options
You can set vSphere HA advanced options by using the vSphere Client or in the fdm.cfg file on the hosts. Table 4-9 provides some of the advanced vSphere HA options.
Table 4-9 Advanced vSphere HA Options
Option |
Description |
---|---|
das.isolationaddressX |
Provides the addresses to use to test for host isolation when no heartbeats are received from other hosts in the cluster. If this option is not specified (which is the default setting), the management network default gateway is used to test for isolation. To specify multiple addresses, you can set das.isolationaddressX, where X is a number between 0 and 9. |
das.usedefaultisolationaddress |
Specifies whether to use the default gateway IP address for isolation tests. |
das.isolationshutdowntimeout |
For scenarios where the host’s isolation response is to shut down, specifies the period of time that the virtual machine is permitted to shut down before the system powers it off. |
das.slotmeminmb |
Defines the maximum bound on the memory slot size. |
das.slotcpuinmhz |
Defines the maximum bound on the CPU slot size. |
das.vmmemoryminmb |
Defines the default memory resource value assigned to a virtual machine whose memory reservation is not specified or is zero. This is used for the Host Failures Cluster Tolerates admission control policy. |
das.vmcpuminmhz |
Defines the default CPU resource value assigned to a virtual machine whose CPU reservation is not specified or is zero. This is used for the Host Failures Cluster Tolerates admission control policy. If no value is specified, the default of 32 MHz is used. |
das.heartbeatdsperhost |
Specifies the number of heartbeat datastores required per host. The default is 2. The acceptable values are 2 to 5. |
das.config.fdm.isolationPolicyDelaySec |
Specifies the number of seconds the system delays before executing the isolation policy after determining that a host is isolated. The minimum is 30. A lower value results in a 30-second delay. |
das.respectvmvmantiaffinityrules |
Determines whether vSphere HA should enforce VM–VM anti-affinity rules even when DRS is not enabled. |
Virtual Machine Settings
To use the Host Isolation Response Shutdown and Restart VMs setting, you must install VMware Tools on the virtual machine. If a guest OS fails to shut down in 300 seconds (or a value specified by das.isolationshutdowntimeout), the virtual machine is powered off.
You can override the cluster’s settings for Restart Priority and Isolation Response for each virtual machine. For example, you might want to prioritize virtual machines providing infrastructure services such as DNS or DHCP.
At the cluster level, you can create dependencies between groups of virtual machines. You can create VM groups, host groups, and dependency rules between the groups. In the rules, you can specify that one VM group cannot be restarted if another specific VM group is started.
VM Component Protection (VMCP)
Virtual Machine Component Protection (VMCP) is a vSphere HA feature that can detect datastore accessibility issues and provide remediation for affected virtual machines. When a failure occurs such that a host can no longer access the storage path for a specific datastore, vSphere HA can respond by taking actions such as creating event alarms or restarting a virtual machine on other hosts. The main requirements are that vSphere HA is enabled in the cluster and that ESX 6.0 or later is used on all hosts in the cluster.
The failures VMCP detects are permanent device loss (PDL) and all paths down (APD). PDL is an unrecoverable loss of accessibility to the storage device that cannot be fixed without powering down the virtual machines. APD is a transient accessibility loss or other issue that is recoverable.
For PDL and APD failures, you can set VMCP to either issue event alerts or to power off and restart virtual machines. For APD failures only, you can additionally control the restart policy for virtual machines by setting it to Conservative or Aggressive. With the Conservative setting, the virtual machine is powered off only if HA determines that it can be restarted on another host. With the Aggressive setting, HA powers off the virtual machine regardless of the state of other hosts.
Virtual Machine and Application Monitoring
VM Monitoring restarts specific virtual machines if their VMware Tools heartbeats are not received within a specified time. Likewise, Application Monitoring can restart a virtual machine if the heartbeats from a specific application in the virtual machine are not received. If you enable these features, you can configure the monitoring settings to control the failure interval and reset period. Table 4-10 lists these settings.
Table 4-10 VM Monitoring Settings
Setting |
Failure Interval |
Reset Period |
---|---|---|
High |
30 seconds |
1 hour |
Medium |
60 seconds |
24 hours |
Low |
120 seconds |
7 days |
The Maximum per-VM resets setting can be used to configure the maximum number of times vSphere HA attempts to restart a specific failing virtual machine within the reset period.
vSphere HA Best Practices
You should provide network path redundancy between cluster nodes. To do so, you can use NIC teaming for the virtual switch. You can also create a second management network connection, using a separate virtual switch.
When performing disruptive network maintenance operations on the network used by clustered ESXi hosts, you should suspend the Host Monitoring feature to ensure that vSphere HA does not falsely detect network isolation or host failures. You can reenable host monitoring after completing the work.
To keep vSphere HA agent traffic on the specified network, you should ensure that the VMkernel virtual network adapters used for HA heartbeats (enabled for management traffic) do not share the same subnet as VMkernel adapters used for vMotion and other purposes.
Use the das.isolationaddressX advanced option to add an isolation address for each management network.
Proactive HA
Proactive High Availability (Proactive HA) integrates with select hardware partners to detect degraded components and evacuate VMs from affected vSphere hosts before an incident causes a service interruption. Hardware partners offer a vCenter Server plug-in to provide the health status of the system memory, local storage, power supplies, cooling fans, and network adapters. As hardware components become degraded, Proactive HA determines which hosts are at risk and places them into either Quarantine Mode or Maintenance Mode. When a host enters Maintenance Mode, DRS evacuates its virtual machines to healthy hosts, and the host is not used to run virtual machines. When a host enters Quarantine Mode, DRS leaves the current virtual machines running on the host but avoids placing or migrating virtual machines to the host. If you prefer that Proactive HA simply make evacuation recommendations rather than automatic migrations, you can set Automation Level to Manual.
The vendor-provided health providers read sensor data in the server and provide the health state to vCenter Server. The health states are Healthy, Moderate Degradation, Severe Degradation, and Unknown.