Split-Brain
When creating your design, make sure you understand the isolation response setting. For instance when using an iSCSI array or NFS based storage choosing “Leave powered on” as your default isolation response might lead to a split-brain situation.
A split-brain situation can occur when the VMDK file lock times out. This could happen when the iSCSI, FCoE or NFS network is also unavailable. In this case the virtual machine is being restarted on a different host while it is not being powered off on the original host because the selected isolation response is “Leave powered on”. Which could potentially leave vCenter in an inconsistent state as two VMs with a similar UUID would be reported as running on both hosts. This would cause a “ping-pong” effect where the VM would appear to live on ESX host 1 at one moment and on ESX host 2 soon after.
VMware’s engineers have recognized this as a potential risk and will come with a solution for this unwanted situation as explained by one of the engineers on the VMTN Community forums. (http://communities.vmware.com/message/1488426#1488426)
In short; as of version 4.0 Update 2 ESX detects that the lock on the VMDK has been lost and issues a question if the virtual machine should be powered off and auto answers the question with yes. However, you will only see this question if you directly connect to the ESX host. HA will generate an event for this auto-answer though, which is viewable within vCenter. Below you can find a screenshot of this question.
Figure 3: Virtual machine message
As stated above, as of ESX 4 update 2 the question will be auto-answered and the virtual machine will be powered off to recover from the split brain scenario.
The question still remains: with iSCSI or NFS, should you power off virtual machines or leave them powered on?
As described above in earlier versions, "Leave powered on" could lead to a split brain scenario. You would end up seeing multiple virtual machines ping-ponging between hosts as vCenter would not know where it resided as it was active in memory on two hosts. As of ESX 4.0 Update 2, this is however not the case anymore and it should be safe to use “Leave powered on”.
We recommend avoiding the chances of a split brain scenario. Configure a secondary Service Console on the same vSwitch and network as the iSCSI or NFS VMkernel portgroup and pre-vSphere 4.0 Update 2 to select either “Power off” or “Shut down” as the isolation response . By doing this you will be able to detect if there’s an outage on the storage network. We will discuss the options you have for Service Console / Management Network redundancy more extensively later on in this book for more detailed information.