Wednesday, 3 December 2008

HA Slot Size Calculations in the Absence of Resource Reservations

Ahhh, the ol' HA slot size calculations. Many a post has been written, many a document published that tried to explain just how HA slot sizes are calculated. But the one thing missing from most of these, certainly from the VMware documentation, is what behaviour can be expected when there are no resource reservations on an individual VM. For example in many large scale VDI deployments that I know of, share based resource pools are used to assert VM priority rather than resource reservations being set per-VM.

So here's the rub.

In the absence of any VM resource reservations, HA calculates the slot size based on a _minimum_ of 256MHz CPU and 256MB RAM. The RAM amount varies however, not by something intelligent like average guest memory allocation or average guest memory usage, but by the table on page 136 / 137 of the resource management guide, which is the memory overhead associated with guests depending on how much allocated memory the guest has, how many vCPU's the guest has, and what the vCPU architecture is. So lets be straight on this. If 256MB > 'VM memory overhead', then 256MB is what HA uses for the slot size calculation. If 'VM memory overhead' > 256MB, then 'VM memory overhead' is what is used for the slot size calculation.

So for example, a cluster full of single vCPU 32bit Windows XP VM's will have the default HA slot size of 256MHz / 256MB RAM in the absence of any VM resource reservations. Net result? A cluster of 8 identical hosts, each with 21GHz CPU and 32GB RAM, and HA configured for 1 host failure, will result in HA thinking it can safely run something like 7 x (21GHz / 256MHz) guests in the cluster!!!

Which led to a situation I came across recently whereby HA thought it still had 4 hosts worth of failover capacity in an 8 host cluster with over 300 VM's running on it, even though the average cluster memory utilisation was close to 90%. Clearly, a single host failure would impact this cluster, let alone 4 host failures. 80-90% resource utilisation in the cluster is a pretty sweet spot imho, the question is simply do you want that kind of utilisation under normal operating conditions, or under failure conditions. As an architect, I should not be making that call - the business owners of the platform should be making that call. In these dark financial times, I'm sure you can guess what they'll opt for. But the bottom line is the business signs off on the risk acceptance - not me, and certainly not the support organisation. But I digress...

I hope HA can become more intelligent in how it calculates slot sizes in the future. Giving us the das.vmMemoryMinMB and das.vmCpuMinMHz advanced HA settings in vCenter 2.5 was a start, something more fluid would be most welcome.

PS. I'd like to thank a certain VMware TAM who is assigned to the account of a certain tier 1 global investment bank that employs a certain blogger, for helping to shed light on this information. You know who you are ;-)