Recently I had to troubleshoot an environment which appeared to have a DRS load-balancing problem. Every time when a host was brought out of maintenance mode, DRS didn’t migrate virtual machines to the empty host. Eventually virtual machines were migrated to the empty host but this happened after a couple of hours had passed. But after a restart of vCenter, DRS immediately started migrating virtual machines to the empty host.
Restarting vCenter removes the cached historical information of the vMotion impact. vMotion impact information is a part of the Cost-Benefit Risk analysis. DRS uses this Cost-Benefit Metric to determine the return on investment of a migration. By comparing the cost, benefit and risks of each migration, DRS tries to avoid migrations with insufficient improvement on the load balance of the cluster.
When removing the historical information a big part of the cost segment is lost, leading to a more positive ROI calculation, which in turn results in a more “aggressive” load-balance operation.
Follow @frankdenneman on twitter
Subscribe to RSS Get updates sent by RSS
vSphere 5 Clustering technical deepdive By Duncan Epping and Frank Denneman
vSphere 4.1 HA and DRS technical deepdive By Duncan Epping and Frank Denneman
April 8th, 2011 at 15:36
Interesting to note. Is there a less disruptive way than restarting the entire vCenter service to effect the cost-benefit cache clearing?
April 8th, 2011 at 15:38
Good stuff Frank,
April 8th, 2011 at 16:09
So was this due to a restart of the application? or a restart of the underlying vCenter service?
April 8th, 2011 at 16:57
Wow. I never thought about that. Do you think it would make sense to have vCenter cache that information more persistently in the future? At least when vCenter gets shut down cleanly?
April 8th, 2011 at 17:04
I see this behavior from Virtual Center 2.5 times. Thanks for explanation.
However, I would like DRS to start load balance the hosts only 10-15 minutes after vCenter service start and not immediately. Many times during troubleshooting or upgrades, when several restarts of vCenter service are required, I find myself unable to connect to vCenter because of DRS performing multiple vMotions at once.
In this situation, “postponed DRS start” feature would be nice and save a lot of time.
Michael.
April 8th, 2011 at 17:05
Thanks Frank, it explains a lot of things indeed
April 8th, 2011 at 19:16
Great post and yeah, that explains why it doesnt happen faster.
April 8th, 2011 at 22:28
That explains it then! I have run into this for years it seems like and I just accepted the vmotion of tons of VMs after a vcenter restart. Thanks Frank, I can sleep better tonight
.
April 9th, 2011 at 04:12
how about artificially creating a load on the Cluster by shutting down some non-critical VMs or some other non destructive activities and see if the DRS triggered again or not.
Needs to be tested for sure – will take some time for me once I get back to work on Tuesday and keep you/Duncan posted.
April 11th, 2011 at 14:57
So I am not crazy! Thanks for the explination!
April 16th, 2011 at 10:49
I have accepted this behavior as part of a less-agresssive set DRS, but it always seemed weird that a host out of four would be left completely empty for a good while, since losing a loaded host to a fault will have a larger impact on the entire cluster.
June 30th, 2011 at 15:20
Frank,
Many thanks for the explanation. Are you aware of any advanced configuration options with which we can alter this behavior post vCenter Server service restart?
Many thanks!
December 26th, 2011 at 01:14
According to vcenter 4.1 U2 release notes, it’s resolved http://www.vmware.com/support/vsphere4/doc/vsp_vc41_u2_rel_notes.html
When DRS is set to automatic, restarting the VMware VirtualCenter Server service might generate a large number of vMotion tasks leading to unnecessary movement of virtual machines. The vMotion tasks are queued and make the management of virtual machines difficult until the tasks are completed.
This issue is resolved in this release.
February 23rd, 2012 at 02:34