- Isolation Response
- Split-Brain
- Isolation Detection
- Failure Detection Time
Failure Detection Time
Failure Detection Time seems to be a concept which is often misunderstood but is critical when designing a virtual infrastructure. Failure Detection Time is basically the time it takes before the “isolation response” is triggered. There are two primary concepts when we are talking about failure detection time:
- The time the host will detect it is isolated
- The time the non-isolated hosts will mark the host as isolated and initiate the failover
The following diagram depicts the timeline for both concepts:
Figure 5: High Availability failure detection time
The default value for isolation failure detection is 15 seconds. (das.failuredetectiontime) In other words the failed or isolated host will be declared dead by the other hosts in the HA cluster on the fifteenth second and a restart will be initiated by the failover coordinator.
For now let’s assume the isolation response is “Power off”. The isolation response “Power off” will be triggered by the isolated host 1 second before the das.failuredetectiontime elapses. In other words a “Power off” will be initiated on the thirteenth second. A restart will be initiated on the fifteenth second by the failover coordinator.
Does this mean that you can end up with your virtual machines being down and HA not restarting them?
Yes, when the heartbeat returns between the 14th and 15th second the “Power off” might have already been initiated. The restart however will not be initiated because the received heartbeat indicates that the host is not isolated anymore.
How can you avoid this?
Selecting “Leave VM powered on” as an isolation response is one option. Increasing the das.failuredetectiontime will also decrease the chances of running into issues like these, and with ESX 3.5 it was a standard best practice to increase the failure detection time to 30 seconds.
At the time of writing (vSphere) this is not a best practice anymore as with any value the “1-second” gap exists and the likelihood of running into this issue is small. We recommend keeping das.failuredetectiontime as low as possible.