Recently I had to troubleshoot an environment which appeared to have a DRS load-balancing problem. Every time when a host was brought out of maintenance mode, DRS didn’t migrate virtual machines to the empty host. Eventually virtual machines were migrated to the empty host but this happened after a couple of hours had passed. But after a restart of vCenter, DRS immediately started migrating virtual machines to the empty host.
Restarting vCenter removes the cached historical information of the vMotion impact. vMotion impact information is a part of the Cost-Benefit Risk analysis. DRS uses this Cost-Benefit Metric to determine the return on investment of a migration. By comparing the cost, benefit and risks of each migration, DRS tries to avoid migrations with insufficient improvement on the load balance of the cluster.
When removing the historical information a big part of the cost segment is lost, leading to a more positive ROI calculation, which in turn results in a more “aggressive” load-balance operation.
Restart vCenter results in DRS load balancing
34 sec read
Interesting to note. Is there a less disruptive way than restarting the entire vCenter service to effect the cost-benefit cache clearing?
Good stuff Frank,
So was this due to a restart of the application? or a restart of the underlying vCenter service?
Wow. I never thought about that. Do you think it would make sense to have vCenter cache that information more persistently in the future? At least when vCenter gets shut down cleanly?
I see this behavior from Virtual Center 2.5 times. Thanks for explanation.
However, I would like DRS to start load balance the hosts only 10-15 minutes after vCenter service start and not immediately. Many times during troubleshooting or upgrades, when several restarts of vCenter service are required, I find myself unable to connect to vCenter because of DRS performing multiple vMotions at once.
In this situation, “postponed DRS start” feature would be nice and save a lot of time.
Michael.
Thanks Frank, it explains a lot of things indeed
Great post and yeah, that explains why it doesnt happen faster.
That explains it then! I have run into this for years it seems like and I just accepted the vmotion of tons of VMs after a vcenter restart. Thanks Frank, I can sleep better tonight :).
how about artificially creating a load on the Cluster by shutting down some non-critical VMs or some other non destructive activities and see if the DRS triggered again or not.
Needs to be tested for sure – will take some time for me once I get back to work on Tuesday and keep you/Duncan posted.
So I am not crazy! Thanks for the explination!
I have accepted this behavior as part of a less-agresssive set DRS, but it always seemed weird that a host out of four would be left completely empty for a good while, since losing a loaded host to a fault will have a larger impact on the entire cluster.
Frank,
Many thanks for the explanation. Are you aware of any advanced configuration options with which we can alter this behavior post vCenter Server service restart?
Many thanks!
According to vcenter 4.1 U2 release notes, it’s resolved http://www.vmware.com/support/vsphere4/doc/vsp_vc41_u2_rel_notes.html
When DRS is set to automatic, restarting the VMware VirtualCenter Server service might generate a large number of vMotion tasks leading to unnecessary movement of virtual machines. The vMotion tasks are queued and make the management of virtual machines difficult until the tasks are completed.
This issue is resolved in this release.
Frank, just curious if anyone figured out a way to force DRS to flush it’s cache without restarting vCenter? I have some smaller clusters which seem to have given up attempting to truly balance the load Even though one of the resources is definitely out of balance.
For now restarting vCenter is do-able, but at some point it may become a nightmare. And with the issue being fixed in 4.1 u2 I wonder if that will even be a valid method to refresh the cache.