Reservations and CPU scheduling

Most of my resource management articles focus more on the behavior of memory management than on CPU management. Mainly because the Memory scheduler within ESX is such an interesting complex system which comprises of memory allocation, swapping and reclamation with algorithms such as Idle Memory Tax and mechanisms like ballooning and swapping. But lately it seems that CPU scheduling seems to attract more and more my attention. The discussion Duncan and I had prior to posting his article about how CPU limits actually sparked the interest how CPU scheduling works when setting reservations, so additional to Duncan excellent article, I want to take a closer look how the ESX CPU scheduler handles CPU reservations and shares and show why CPU scheduling is more fair that memory management.

Similar to memory, the resource allocation settings, reservations, shares and limits can be set on CPU level. Limits and shares have similar behavior on CPU as well as Memory. Reservation act differently, let's take a quick look at the resource allocation settings:

Shares:Shares indicate the proportional value of the entity on the same hierarchical level. If everything else is equal, reservations, limits and active utilization, the virtual machine that is allocated twice as many shares as another virtual machine is entitled to consume twice as many CPU cycles.

Limit: A limit is a mechanism to restrict physical resource usage of the virtual machine. A limit ensures that the VM will never receive more CPU cycles than specified, even if extra cycles are available on the host.

Reservation: A reservation is a guarantee of the specified amount of physical resources regardless of the total number of shares in his environment.

Now reservations act differently when setting it on a CPU than setting it on memory. When the virtual machine does not use its CPU cycles, these CPU cycles are redistributed to other active virtual machines, so unused reservations are not wasted. Contrary to memory management, when the memory will not be reclaimed by the scheduler once the virtual machine touched the pages.

By redistributing available CPU cycles and not letting the virtual machine hoard CPU resources, the VMkernel tries to properly divide the resources and achieve better fairness among virtual machines and improve utilization of the resources. To achieve both goals and divide the CPU resources among virtual machines the CPU scheduler calculates a MHzPerShare metric. This metric tries to identify which virtual machines are "ahead" of their entitlement and which virtual machines are "behind" and do not fully utilize their entitlement.

MHzPerShare = MHzUsed / Shares

MHzUsed is the current utilization of the virtual machine measured in Megahertz.
Shares is the current configured amount of shares of the virtual machine.

For example, the virtual machine is using 2500 MHZ and has 1000 shares, this means that the MHzPershare value is 2.5.The VMkernel will calculate the MHzPerShare number of each active virtual machine and the virtual machine with the lowest MHzPerShare value will have the highest priority of running on the CPU. If the virtual machine with the lowest MHzPerShare value decides not to use it right to allocate the cycles, the cycles can be used by the virtual machine with the next lower MHzPerShare value.

ESX CPU Scheduler MHzPerShare distribution

Although not shown, reservations play a important part in this calculation. As mentioned before, reservations overrule shares and guarantee the amount of physical resources regardless of the amount of shares. This means that the virtual machine always can use the CPU cycles specified in its reservation, even if the virtual machine has a greater MHzPerShare value. So how exactly do reservations and shares interact with each other when it comes to calculating the MHzPerShare value?

For example:

In a 6 GHz system, 1 virtual machine is running and 2 are powered on, VM1 is running a memory intensive app and doesn't really care much about CPU cycles, the virtual machine is configured with 1000 CPU shares and no reservation. The 2 other virtual machines run CPU intensive apps and are currently competing for resources. VM2 has a reservation of 2250 MHz and has a default share setting of 1000 shares, the other CPU intensive virtual machine, VM3 is equipped with 2vcpu's and therefore receives 2000 shares, but the administrator didn't set any reservation.

Now VM1 is running at 500 MHz, with its 1000 shares, the MHzPerShare value equals 0.5. Because VM2 is in need of CPU cycles, it immediately utilizes its reservations and "occupies" all 2250 MHz, its
MHzPerShare value equals 2.25 (2250/1000).
ESX CPU scheduler free MHz distribution

Now because VM3 doesn't have any reservation and is in need of CPU cycles, the VMkernel looks at its MHzPerShare value to decide how many CPU cycles it can use before distributing excess CPU cycles to other virtual machines. The kernel will distribute cycles to VM3 until it reaches the same MHzPerShare value of VM2, which is 2.25. In theory this means that the VMkernel will allocate 2000 x 2.25 = 4500 MHz before looking at another VM. Due to the fact that CPU scheduler already allocated 500 MHz to VM1 and 2250 MHz to VM2 of the available 6GHz, it can allocate VM3 3250 Mhz.
ESX CPU Scheduler MHzPerShare value

Because VM2 has a reservation it can allocate up to its reservation even when initially VM3 has a lower MHzPerShare value (0) and the CPU cycle requirements of VM1 are met at 500MHz. However due to the fairness principle VM2's own MHzPerShare value influences the VMkernel's decision how much cycles to allocate to VM3 before considering allocating additional cycles to vm2 again.

Now for some reason the application in VM3 is leveling out at 2000 MHz, VM1 is still using 500 MHz and VM2 is in desperate need of extra CPU cycles. No settings are changed so VM1 and VM2 has a 1000 shares each and VM2 has a reservation of 2250MHz, VM3 has 2000 shares and no reservation is set.

The VMkernel will satisfy the request of VM1, resulting in a MHzPerShare value of 0.5. VM2 claims its reservation and utilizes 2250 MHz resulting in a MHzPerShare value of 2.25, VM3 can allocate up to 4500 before reaching the MHzPerShare value of VM3, but stops consuming above 2000Mhz, ending up with a MHzPerShare value of 2000/2000 = 1, this means that inside the 6GHz host 1250 cycles are available.

The CPU scheduler will shop around with these available cycles and see which VM is interested. Now the VMkernel will offer the cycles to the virtual machines in the increasing order of MHzPerShare, so first it will ask VM1 (0.5), because its CPU request is satisfied, it will forfeit its claim, VM2 also forfeits this claim, so VM3 will happily accepts the remaining cycles and its resource usage will increase to 3500 MHz.

So here you have it, both shares and reservation interact or even battle with each other to allocate CPU cycles for the virtual machines. Shares are by many perceived as an inferior resource allocation setting, hopefully this demonstrates the power of shares, it can in combination with utilization become a very important factor in ESX resource management.

Frank Denneman

Follow me on Twitter, visit the facebook fanpage of FrankDenneman.nl, or add me to your Linkedin network

You may also like...

19 Responses

  1. duncan says:

    Now the real question is how does this work with scheduling on a CPU layer. if VM3 gets assigned those 3500Mhz and each core is 2200Mhz and it only has a single vCPU will it get scheduled twice in a row with the second scheduling option being limited to 1300 Mhz?

    Seriously complicated.

  2. Another very insightful article on the internal workings of the ESX scheduler

    Couple of quick question Frank. In the following section of the article

    "The VMkernel will calculate the MHzPerShare number of each active virtual machine and the virtual machine with the lowest MHzPerShare value will have the highest priority of running on the CPU. If the virtual machine with the lowest MHzPerShare value decides not to use it right to allocate the cycles, the cycles can be used by the virtual machine with the next -- lower -- MHzPerShare value."

    Should the -- lower -- actually be -- higher -- MHzPerShare, i.e. it passes up through the levels of MHzPerShare. i.e. the lowest MHzPerShare has highest priority, doesn't utilise the cycles so passes it to the lower priority VM which has the next highest MHzPerShare.

    I'm also trying to understand the following statement from your 3 VM example

    "this means that inside the 6GHz host 1250 cycles are available."

    Where does this figure come from and what is its significance?

    In your example 4750 MHz are in use with a grand total of 3.75 MHzPerShare across the 3 VM's. dividing MHZ / MHZperShare gives 1266, is this where that number comes from?

    I might have this totally wrong but I was always taking the MHz value as the number of cycles.

    Regards

    Craig

  3. Hi Craig,

    Should the — lower — actually be — higher — MHzPerShare, i.e. it passes up through the levels of MHzPerShare. i.e. the lowest MHzPerShare has highest priority, doesn’t utilise the cycles so passes it to the lower priority VM which has the next highest MHzPerShare.

    The sentence you refering to, took me a while to formulate, but it still isn't clear I suppose. What Im trying to say is that the scheduler is offering the MHz to the virtual machine with the lowest value, let's say its VM1, if this VM1 declines the offer, the CPU scheduler will offer the CPU cycles to the VM with the next lowest value, for instance VM2, so the MHzPerShare value of VM2 is higher than VM1, but lower than VM3. Does this make sense?

    In your example 4750 MHz are in use with a grand total of 3.75 MHzPerShare across the 3 VM's. dividing MHZ / MHZperShare gives 1266, is this where that number comes from?

    So inside the 6GHz box, VM1 is using 500 MHz, which leaves the CPU scheduler with 5500 MHz to divide, VM2 claims it reservation of 2250 MHZ, creating a load of 2750 MHZ, VM3 comes along and it can claim 4500 MHZ based on its MHzperShare level, due to its increased share value. Because the application is satisfied with 2000 MHz, the claim of VM3 stops at 2000 MHz, making the total MHz allocation of 4750. Because the CPU scheduler has excess cycles of 1250 MHz.

    The MHzPerShare value is used by the CPU scheduler per VM not as overall number. It indicates the relative priority of VM's within the ESX box.

    I used Mhz and cycles to indicate the same unit of computation.

  4. Louw Pretorius says:

    Thx Frank for this article, but I always have a headache after reading your posts...

  5. Is my writing THAT bad?

  6. Hi Frank

    Thanks for coming back on my points, I've re-read it again and I totally see what you're saying, it makes sense.

    I must have had one of those headaches that Louw was talking about when reading it the first time. It's not your writing it's the sheer complexity and in depth nature of the articles. If you don't work in depth with this stuff day in day out it can take a little time to get your head round it.

    However it is worthwhile that you do get your head round it so that you fully understand the impact of things like reservations and shares. Lack of understanding the full impact may explain why at the recent Scottish VMUG hardly anyone in the audience was using shares as a means of resource control, they worry about added complexity of manual setting it.

    I spent a lot of time reading up about resource groups and shares before I implemented our new vSphere cluster at work. Since then multiple articles have come out from yourself and Duncan Epping regarding CPU / Memory / Shares / Scheduling which have caused me to revisit my design and think about the points your making. I haven't found anything wrong with what I implemented but have learnt a lot along the way about what could go wrong.

    So thanks for that :o)

  7. Sudharsan says:

    Hi Frank ,

    I have the same question that has been asked by Duncan .
    "if VM3 gets assigned those 3500Mhz and each core is 2200Mhz and it only has a single vCPU will it get scheduled twice in a row with the second scheduling option being limited to 1300 Mhz?"

    Also , What happens when there is only one VM in a ESX host , with 2000 shares with no shares and no limitations on Dual CPU Quad Core Server having a total compute power of 20 GHz. How frequently will it be scheduled ?

    Can you please clarify.

  8. I have the same question that has been asked by Duncan .
    “if VM3 gets assigned those 3500Mhz and each core is 2200Mhz and it only has a single vCPU will it get scheduled twice in a row with the second scheduling option being limited to 1300 Mhz?”

    A vCPU can only use one core at the time; this means that the vCPU never can exceed the physical speed of the pCPU. The instructions pushed down from the virtual machine need to “fit” inside the pCPU. If you have one vCPU you cannot assign it more CPU cycles than the pCPU speed. To answer your question, the vCPU will be scheduled a max of 2200 MHZ. If an application on the vCPU is busy and hoarding the vCPU, it can be scheduled by the CPU dispatcher again depending on its resource entitlement and the level of contention on the ESX server.

    Coming back to the max speed of a vCPU, the option Unlimited of the limit setting is quite a strange setting. With the unlimited limit setting you can configure a virtual machine to have a limit greater than the speed of a pCPU. vCenter allows you to set a higher limit because a vCPU has additional worldlets, the VMkernel CPU scheduler can run these in parallel with the vCPU. Because the pararell run wordlets on different pCPU’s, the vCPU can effectively consume more than one core (and thus more speed)

    Also , What happens when there is only one VM in a ESX host , with 2000 shares with no shares and no limitations on Dual CPU Quad Core Server having a total compute power of 20 GHz. How frequently will it be scheduled?

    It will be scheduled whenever the VMkernel gets instructions, when a vCPU stops sending instructions it will be descheduled. There is a time slice of an x amount of milliseconds (don’t know if I can make the exact amount of milliseconds public). This means that a vCPU will run an X amount of time before the dispatcher is called again. But based on the resource entitlement and the amount of contention in the system, which in this example is none, the dispatcher shall schedule the vCPU directly again. The impact of deschedule/schedule is neglectable and invisible to virtual machines.

  9. Felipe Espiritu says:

    Frank,

    These articles are explained on detail on your HA&DRS book?? I already bought it =)

    Regards from Mexico

  10. Arun Raju says:

    We have been having problems with SAP VMs hosted in vSphere 4.1 clusters. These hosts are running ESXi v4.1 U1. Whenever a VM is vMotioned using DRS to another host, time drifts are happening. The time on the VM drifts by about 50 secs when compared to the correct time. There are no CPU or Memory Reservations for any of the hosted VMs.

    I have a serious doubt that even though these VMs are configured with 8vCPUs with 32GB RAM each, the VMkernel is taking the idle CPU cycles when the VM is not using it and that's the reason time drift occurs when vMotion is done.

    Please correct me if I am wrong.

  11. Sebastian says:

    Hi Frank,

    Even when the post is "old", it's really helpful for people like me. So, first of all, a big thank you for sharing this information in such a great article!

    I always get confused with this subject, so I will try to make my question as clear as possible. Can the VMkernel "split" a logical processor in order to accomodate at the same time two or more VMs in it? For instance, if a have a 1000Mhz processor (no hyperthreading), can VMkernel run 10 VMs with 100Mhz each at the same time?

    I know the question is pretty basic, but if you please tell me the answer, that would be terrific! As for my understanding of the Performance Best Practices and your post, the answer should be "yes", but again, I also heard that processor time is "absolute" in terms that you can control the time spent on each VM but no how much Mhz you really assign to it.

    Thanks!
    Sebas

  12. Sjoerd Hooft says:

    Hi Frank,
    You state:
    VM2 also forfeits this claim, so VM3 will happily accepts the remaining cycles

    But shouldn't it be the other way around. VM3 forfeits it's claim (leveled at 2000) and VM2 will happily accepts the remaining cycles (since it is in desperate need of extra cycles).

    Typo?

    Kind regards,
    Sjoerd Hooft
    PS. Sorry for bothering you about an old post but it has been in my "ToRead" folder for quite some time now...

  13. Sjoerd,

    That is possible, but in this scenario, VM2 doesn't need any more shares and therefor VM3 gets the shares. VM2 already got its cycles.

  14. Martin says:

    Hi Frank,

    Excellent article ! It gave me perfect overview how do the math between those MHz on server.
    Do I get it right that setting Share to high value ( 8000 ) is better ?
    More cycles for whole system but when needed immediately allocated to VM ?

    MHzPerShare will be alwas lower when Shares are set to high value.

    Thanks
    Cheers
    Martin

  1. August 18, 2010

    [...] posts: Which host is selected for an HA initiated restart? Memory reclamation, when and how? Reservations and CPU scheduling The Math Behind the DRS [...]

  2. September 24, 2010

    [...] Note: Reservations on resource pool act differently than reservations on VM-level, for a refresher please read the articles: “Resource Pools memory reservations” and “Impact of memory reservations“. In addition CPU type reservations behave differently from Memory reservations, please read the article “Reservations and CPU scheduling”. [...]

  3. October 7, 2010

    [...] As mentioned before, running a vCPU on a SMT thread will not offer the same progress than running on a complete core; therefore a different charging scheme is used when running for both scenarios. This charging scheme is used to keep track of the delivered resources and to check if the VM gets it entitled resources, more on this topic can be found in the article “Reservations and CPU scheduling”. [...]

  4. January 11, 2011

    [...] Resource entitlements As CPU affinity will not automatically isolate the CPU for that specific virtual machine, shares and reservations needs to be set to guarantee a specific performance level. Because the scheduler will attempt to maintain fairness for all virtual machines it is possible that other virtual machines will be scheduled on the set of CPU specified in the affinity set of the virtual machine. Adjust the shares and reservations of the virtual machine accordingly to ensure priority over other active virtual machines. Be aware that CPU reservations are friendly; although the vCPU is guaranteed a specific portion of physical resources, it might happen that an external thread/interloper (other virtual machine) is using the vCPU; this thread will not instantly be de-scheduled. Even when the waiting virtual machine has a 100% CPU reservation configured. To make it worse, in the case when multiple virtual machines are affinity-bound to the same processor it is possible that the CPU scheduler cannot meet the specified reservation. Be aware that admission control ignores affinity, so multiple virtual machines can have a full reservation equal to a full core but still need to compete with other affinity bound virtual machines. More information about how CPU reservations work can be found in the article: “Reservations and CPU Scheduling”. [...]

  5. May 8, 2013