VMware High Availability Constructs
- Isolation Response
- Split-Brain
- Isolation Detection
- Failure Detection Time
This is Chapter 5 from Duncan Epping's upcoming book.
When configuring HA two major decisions will need to be made.
- Isolation Response
- Admission Control
Both are important to how HA behaves. Both will also have an impact on availability. It is really important to understand these concepts. Both concepts have specific caveats. Without a good understanding of these it is very easy to increase downtime instead of decreasing downtime.
Isolation Response
One of the first decisions that will need to be made when HA is configured is the “isolation response”. The isolation response refers to the action that HA takes for its VMs when the host has lost its connection with the network. This does not necessarily means that the whole network is down; it could just be this hosts network ports or just the ports that are used by HA for the heartbeat. Even if your virtual machine has a network connection and only your “heartbeat network” is isolated the isolation response is triggered.
Today there are three isolation responses, “Power off”, “Leave powered on” and “Shut down”. This answers the question what a host should do when it has detected it is isolated from the network.
In any of the three chosen options, the remaining non isolated, hosts will always try to restart the virtual machines no matter which of the following three options is chosen as the isolation response:
- Power off When network isolation occurs all virtual machines are powered off. It is a hard stop, or to put it bluntly, the power cable of the VMs will be pulled out!
- Shut down When network isolation occurs all virtual machines running on the host will be shut down using VMware Tools. If this is not successful within 5 minutes, a “power off” will be executed. This time out value can be adjusted by setting the advanced option das.isolationShutdownTimeout. If VMware Tools is not installed, a “power off” will be initiated immediately.
- Leave powered on When network isolation occurs on the host, the state of the virtual machines remains unchanged.
This setting can be changed on the cluster settings under virtual machine options.
Figure 1: Cluster default settings
The default setting for the isolation response has changed multiple times over the last couple of years. Up to ESX 3.5 U2 / vCenter 2.5U2 the default isolation response when creating a new cluster was “Power off”. This changed to “Leave powered on” as of ESX 3.5 U3 / vCenter 2.5 U3. However with vSphere 4.0 this has been changed again. The default setting for newly created clusters is “Shut down”. When installing a new environment; you might want to change the default setting based on your customer’s requirements or constraints.
The question remains, which setting should you use? The obvious answer applies here; it depends. We prefer “Shut down” because we do not want to use a degraded host to run our virtual machines on and it will shut down your virtual machines in clean manner. Many people however prefer to use “Leave powered on” because it eliminates the chances of having a false positive and the associated down time with a false positive. A false positive in this case is an isolated heartbeat network but a non-isolated virtual machine network and a non-isolated iSCSI / NFS network.
That leaves the question how the other HA nodes know if the host is isolated or failed.
HA actually does not know the difference. The other HA nodes will try to restart the affected virtual machines in either case. When the host has failed, a restart attempt will take place no matter which isolation response has been selected. If a host is merely isolated, the non-isolated hosts will not be able to restart the affected virtual machines. This is caused by the lock on the VMDK and swap files. None of the hosts will be able to boot a virtual machine when the files are locked. For those who don’t know, ESX locks files to prevent the possibility of multiple ESX hosts starting the same virtual machine. However, when a host fails, this lock expires and a restart can occur.
To reiterate, the remaining nodes will always try to restart the “failed” virtual machines. The possible lock on the VMDK files belonging to these virtual machines, in the case of an isolation event, prevents them from being started. This assumes that the isolated host can still reach the files, which might not be true if the files are accessed through the network on iSCSI, NFS, or FCoE. HA however will repeatedly try starting the “failed” virtual machines when a restart is unsuccessful.
The amount of retries is configurable as of vCenter 2.5 U4 with the advanced option “das.maxvmrestartcount”. The default value is 5. Pre-vCenter 2.5 U4 HA would keep retrying forever which could lead to serious problems as described in KB article 1009625 where multiple virtual machines would be registered on multiple hosts simultaneously leading to a confusing and inconsistent state. (http://kb.vmware.com/kb/1009625)
HA will try to start the virtual machine one of your hosts in the affected cluster; if this is unsuccessful on that host, the restart count will be increased by 1. The next restart attempt will than occur after two minutes. If that one fails, the next will occur after 4 minutes, and if that one fails the following will occur after 8 minutes until the “das.maxvmrestartcount” has been reached.
To make it more clear look at the following list:
- T+0 Restart
- T+2 Restart retry 1
- T+4 Restart retry 2
- T+8 Restart retry 3
- T+8 Restart retry 4
- T+8 Restart retry 5
As shown above in the bullet list and clearly depicted in the diagram below; a successful power-on attempt could take up to 30 minutes in the case multiple power-on attempts are unsuccessful. However HA does not give a guarantee and a successful power-on attempt might not ever take place.
Figure 2: High Availability restart timeline