VMware Archives - Page 14 of 29

Introduction 2016 NUMA Deep Dive Series

July 6, 2016 by frankdenneman

Recently I’ve been analyzing traffic to my site and it appears that a lot CPU and memory articles are still very popular. Even my first article about NUMA published in february 2010 is still in high demand. And although you see a lot of talk about the upper levels and overlay technology today, the focus on proper host design and management remains. After all, it’s the correct selection and configuration of these physical components that produces a consistent high performing platform. And it’s this platform that lays the foundation for the higher services and increased consolidating ratios.

Most of my NUMA content published throughout the years is still applicable to the modern datacenter, yet I believe the content should be refreshed and expanded with the advancements that are made in the software and hardware layers since 2009.

To avoid ambiguity, this deep dive is geared towards configuring and deploying dual socket systems using recent Intel Xeon server processors. After analyzing the dataset of more than 25.000 ESXi host configurations collected from virtual datacenters worldwide, we discovered that more than 80% of ESXi host configuration are dual socket systems. Today, according to IDC, Intel controls 99 percent of the server chip market.
Despite the strong focus of this series on the Xeon E5 processor in a dual socket setup, the VMkernel, and VM content is applicable to systems running AMD processors or multiprocessor systems. No additional research was done on AMD hardware configurations or performance impact when using high-density CPU configurations.

The 2016 NUMA Deep Dive Series

The 2016 NUMA Deep Dive Series consists of 7 parts.The 2016 NUMA deep dive series is split into three main categories; Physical, VMkernel, and Virtual Machine.

Part 1: From UMA to NUMA
Part 1 covers the history of multi-processor system design and clarifies why modern NUMA systems cannot behave as UMA systems anymore.

Part 2: System Architecture
The system architecture part covers the Intel Xeon microarchitecture and zooms in on the Uncore. Primarily focusing on Uncore frequency management and QPI design decisions.

Part 3: Cache Coherency
The unsung hero of today’s NUMA architecture. Part 3 zooms in to cache coherency protocols and the importance of selection the proper snoop mode.

Part 4: Local Memory Optimization
Memory density impacts the overall performance of the NUMA system, part 4 dives into the intricacy of channel balance and DIMM per Channel configuration.

Part 5: ESXi VMkernel NUMA Constructs
The VMkernel has to distribute the virtual machines to provide the best performance. This part explores the NUMA constructs that are subject to initial placement and load-balancing operations.

Part 6: NUMA Initial Placement and Load Balancing Operations
The VMkernel has to distribute the virtual machines to provide the best performance. This part explores the NUMA initial placement and load-balancing operations. (not yet released)

Part 7: From NUMA to UMA
The world of IT moves in loops of iteration, the last 15 years we moved from UMA to NUMA systems, which today’s focus on latency and the looming licensing pressure, some forward-thinking architects are looking into creating high performing UMA systems. (not yet released)

The articles will be published on a daily basis to avoid saturation. Similar to other deep dives, the articles are lengthy and contain lots of detail. Up next, Part 1: From UMA to NUMA

vSphere 6.x host resource deep dive session (8430) accepted for VMworld US and Europe

June 16, 2016 by frankdenneman

Yesterday both Niels and I received the congratulatory message from the VMworld team, informing us that our session is accepted for both VMworld US and Europe. We are both very excited that our session was selected and we are looking forward at presenting to the VMworld audience. Our session is called the vSphere 6.x host resource deep dive (session ID 8430) and is an abstract of our similar titled book (publish date will be disclosed soon).
Session Outline
Today’s focus is on upper levels/overlay’s (SDDC stack, NSX, Cloud) but proper host design and management still remains the foundation of success. With the introduction of these new ‘overlay’ services, we are presented with a new consumer of host resources. Ironically it’s the attention to these abstraction layers that returns us to focusing on individual host components. Correct selection and configuration of these physical components leads to creating a stable high performing platform, that lays the foundation for the higher services and increased consolidating ratios.
Topics we will address in this presentation are:
The introduction of NUMA (Non-Uniform Memory Access) required changes in memory management. Host physical memory is now split into local and remote memory structures for CPUs that can impact virtual machine performance. We will discuss how to right size your VMs CPU and memory configuration in regards to NUMA and vNUMA VMkernel CPU scheduler characteristics. Processor speed and core counts are important factors when designing a new server platform. However with virtualization platforms the memory subsystem can have equal or sometimes even have a greater impact on application performance than the processor speed.
In this talk we focus on physical memory configurations. Providing consistent performance is key to predictable application behavior. It benefits day-to-day customer satisfaction and helps reduce application performance troubleshooting. This talk covers flash architecture and highlights the differences between the predominant types of local storage technologies. We look closer into recurring questions about virtual networking. For example, how many resources does the VMkernel claim for networking, what impact does a vNIC type has on resource consumption. Such info allows you to get better grips on sizing your virtual datacenter for NFV workloads.
Key Takeaway 1:
Identifying how proper NUMA and physical memory configuration allows for increased VM performance
Key Takeaway 2:
What is the impact of virtual network services on consumption of host compute resources?
Key Takeaway 3:
How next-gen storage components lead to low latency, higher bandwidth and increased scalability.
Key dates:
VMworld US takes place at Mandalay Bay Hotel & Convention Center in Las Vegas, NV from August 28 – September 1, 2016
VMworld Europe takes place at Fira Barcelona Gran Via in Barcelona, Spain from 17 – 20 October, 2016
Repeat the feat
Five years ago Duncan and I got this room completely full with our vSphere Clustering Deepdive Q&A, I would love to repeat that feat doing a Host deep dive session. I hope to see you all in our session!

Home Lab Fundamentals: DNS Reverse Lookup Zones

June 13, 2016 by frankdenneman

When starting your home lab, all hints and tips are welcome. The community is full of wisdom, yet sometimes certain topics are taken for granted or are perceived as common knowledge. The Home Lab fundamentals series focusses on these subjects, helping you how to avoid common pitfalls that provide headaches and waste incredible amounts of time.
One thing we always keep learning about vSphere is that both time and DNS needs to be correct. DNS resolution is important to many vSphere components. You can go a long way without DNS and use IP-addresses within your lab, but at one point you will experience weird behavior or installs just stop without any clear explanation.In reality vSphere is build for professional environments where it’s expected that proper networking structure is in place, physical and logical. When reviewing a lot of community questions, blog posts and tweets, it appears that DNS is partially setup, i.e. only forward lookup zones are configured. And although it appears to be ”just enough DNS to get things going, many have experienced that their labs start to behave differently when no reverse lookup zones are present. Time-outs or delays are more frequent, the whole environment isn’t snappy anymore. Ill-configured DNS might give you the idea that the software is crap but in reality, it’s the environment that is just configured crappy. When using DNS, use the four golden rules; forward, reverse, short and full. DNS in a lab environment isn’t difficult to set up and if you want to simulate a proper working vSphere environment then invest time in setting up a DNS structure. It’s worth it! Besides expanding your knowledge, your systems will feel more robust and believe me, you will wait a lot less on systems to respond.
vCenter and DNS
vCenter inventory and search rely heavy on DNS. And since the introduction of vCenter Single Sign-On service (SSO) as a part of the vCenter Server management infrastructure DNS has become a crucial element. SSO is an authentication broker and security token exchange infrastructure. As described in the KB article Upgrading to vCenter Server 5.5 best practices (2053132);

With vCenter Single Sign-On, local operating system users become far less important than the users in a directory service such as Active Directory. As a result, it is not always possible, or even desirable, to keep local operating system users as authenticated users.

This means that you are somewhat pressured into using an ‘external’ identity source for user authentication, even for your lab environment . One of the most popular configurations is the use of Active Directory as an identity source. Active Directory itself uses DNS as the location mechanism for domain controllers and services. If you have configured SSO to use Microsoft Active Directory for authentication, you might have seen some weird behavior when you haven’t created a reverse DNS lookup zone.

Installation of vCenter Server (Appliance) fails if the FQDN and IP addresses used are not resolvable by the DNS server specified during the deployment process. The vSphere 6.0 Documentation Center vSphere DNS requirements state the following:

Ensure that DNS reverse lookup returns a Fully Qualified Domain Name (FQDN) when queried with the IP address of the host machine on which vCenter Server is installed. When you install or upgrade vCenter Server, the installation or upgrade of the Web server component that supports the vSphere Web Client fails if the installer cannot look up the fully qualified domain name of the vCenter Server host machine from its IP address. Reverse lookup is implemented using PTR records.

Before deploying vCenter I recommend to deploy a virtual machine on the first host running a DNS server. The ESXi Embedded Host Client allows you to deploy a virtual machine on an ESXi host without the need of having an operational vCenter first. As I use active Directory as identity source for authentication, I deploy a Windows AD server with DNS before deploying the vCenter Server Appliance (VCSA). Toms IT pro has a great article on how to configure DNS on a Windows 2012 server, but if you want to configure a lightweight DNS server running on Linux, follow the steps Brandon Lee has documented. If you want to explore the interesting world of DNS, you can also opt to use Dynamic DNS to automatically register both the VCSA and ESXi hosts in the DNS server. Dynamic DNS registration is the process by which a DHCP client register its DNS with a name server. For more information please check out William article “Does ESXi Support DDNS (Dynamic DNS)?” . Although he published it in 2013. it’s still a valid configuration in ESXi 6.0.
Flexibility of using DNS
Interestingly enough, having a proper DNS structure in place before deploying the virtual infrastructure provides future flexibility. One of the more annoying time wasters is the result of using an IP address instead of an FQDN during setup of the VCSA. When you use only an IP-address instead of a Fully Qualified Domain Name (FQDN) during setup, changing the hostname or IP-address will produce this error:
IPv4 configuration for nic0 of this node cannot be edited post deployment.
Kb article 2124422 states the following:

Attempting to change the IP address of the VMware vCenter Server Appliance 6.0 fails with the error: IPv4 configuration for nic0 of this node cannot be edited post deployment. (2124422)

This occurs when the VMware vCenter Server Appliance 6.0 is deployed using an IP address. During the initial configuration of the VMware vCenter Server Appliance, the system name is used as the Primary Network Identifier. If the Primary Network Identifier is an IP address, it cannot be changed after deployment.
This is an expected behavior of the VMware vCenter Server Appliance 6.0. To change the IP address for the VMware vCenter Server Appliance 6.0 that was deployed using an IP address, not a Fully Qualified Domain Name, you must redeploy the appliance with the new IP address information.

Changing the hostname will result in the Platform Service Controller (responsible for SSO) to fail. According to Kb article:

Changing the IP address or host name of the vCenter Server or Platform Service controller cause services to fail (2130599)

Changing the Primary Network Identifier (PNID) of the vCenter Server or PSC is currently not supported and will cause the vSphere services to fail to start. If the vCenter Server or PSC has been deployed with an FQDN or IP as the PNID, you will not be able to change this configuration.
To resolve this issue, use one of these options:

Revert to a snapshot or backup prior to the IP address or hostname change.

Redeploy the vSphere environment.

This means that you cannot change the IP-address or the host name of the vCenter Appliance. Yet another reason to deploy a proper DNS structure before deploying your VCSA in your lab.
FQDN and vCenter permissions
Even when you have managed to install vCenter without a reverse lookup zone, the absence of DNS pointer records can obstruct proper permission configuration according to (KB article 2127213)

Unable to add Active Directory users or groups to vCenter Server Appliance or vRealize Automation permissions

Attempting to browse and add users to the vCenter Server permissions (Local Permission: Hosts and Clusters > vCenter >Manage >Permissions)(Global Permissions: Administration > Global Permissions) fails with the error:

Cannot load the users for the selected domain

A workaround for this issue is to ensure that all DNS servers have the Reverse Lookup Zone configured as well as Active Directory Domain Controller (AD DC) Pointer (PTR) records present. Please note that allowing domain authentication (assuming AD) on the ESXi host does not automatically add it to an AD managed DNS zone. You’ll need to manually create the forward lookup (which will give the option for the reverse lookup creation too).
SSH session password delay
When running multiple hosts most of you will recognize the waste of time when (quickly) wanting to log into ESXi via an SSH session. Typically this happens when you start a test and you want to monitor ESXTOP output. You start your ssh session, to save time you type on the command line ssh root@esxi.homelab.com and then you have to wait more than 30 seconds to get a password prompt back. Especially funny when you are chasing a VM and DRS decided to move it to another server when you weren’t paying attention. To get rid of this annoying time waster forever:

DNS name resolution using nslookup takes up to 40 seconds on an ESXi host(KB article 2070192)

When you do not have a reverse lookup zone configured, you may experience a delay of several seconds when logging in to hosts via SSH.

When you’re management machine is not using the same DNS structure, you can apply the quick hack of adding “useDNS no” to the /etc/ssh/sshd.config file on the ESXi host to avoid the 30-second password delay.
Troubleshoot DNS
BuildVirtual.net published an excellent article on how to troubleshoot ESXi Host DNS and Routing related issues. For more information about setting the DNS configuration from the command line, review this section of the VMware vSphere 6.0 Documentation Center
vSphere components moving away from DNS
As DNS is an extra dependency, a lot of newer technologies try to avoid incorporate DNS dependencies. One of those is VMware HA. HA has been redesigned and the new FDM architecture avoided DNS dependencies. Unfortunately not all VMware official documentation has been updated with this notion: https://kb.vmware.com/kb/1003735 states that ESX 5.x also has this problem but that is not true. Simply put, VMware HA in vSphere 5.x and above does not depend on DNS for operations or configurations.
Home Lab Fundamentals Series:

Up next in this series: vSwitch0 routing

Home Lab Fundamentals: Time Sync

June 3, 2016 by frankdenneman

First rule of Home Lab club, don’t talk about time sync! Or so it seems. When starting your home lab, all hints and tips are welcome. The community is full of wisdom, however sometimes certain topics are taken for granted or are perceived as common knowledge. The Home Lab fundamentals series focusses on these subjects, helping you how to avoid the most common pitfalls that provide headaches and waste incredible amounts of time. A ‘time-consuming’ pitfall is dealing with improper time synchronization between the various components in your lab environment.
Most often, the need for time synchronization is seen as an Enterprise requirement but not really necessary for lab environments. Maybe because most think time synchronization is solely necessary for troubleshooting purposes. In some cases, this is true as ensuring correct time notation allows for proper correlation of events. Interestingly enough, this alone should be enough reason to maintain synchronized clocks throughout your lab, but most home labs are just rebuilt when troubleshooting becomes too time-consuming. However time sync is much more expedite troubleshooting and ignoring time drift is a straight path into the rabbit hole. Time synchronization utilities such as NTP are necessary to correct time drift introduced by hardware time drift and guest operating system timekeeping imprecision. When time differs between systems to much it can lead to installation and authentication errors. Unfortunately, time issues are not always easily identifiable, to provide a great example;

“[400] An error occurred while sending an authentication request to the vCenter Single Sign-On server – An error occurred when processing the metadata during vCenter Single Sign-On setup – null.”

This particular issue occurs due to a time skew between the vCenter Server Appliance 6.0 and the external Platform Service Controller. Here are just a few other examples of what can go wrong in your lab due to time skew issues;

Adding a host in vCenter Server fails with the error: Failed to configure the VIM account on the host (1029863) Time skew between ESXi host hardware clock and vCenter Server system time. https://kb.vmware.com/kb/1029863
After joining the Virtual Center Server Appliance to a domain you cannot see domain when adding user permissions (2011965): This issue occurs when the time skew between the Virtual Center Server Appliance(VCSA) and a related Domain Controller is greater than 5 minutes. https://kb.vmware.com/kb/2011965
Cluster level performance graphs show the most recent value as 0: This metric is susceptible to clock skew between the vSphere Client, vCenter Server, and ESX hosts. If any of the hosts have a skewed clock, the entire cluster shows as 0. https://kb.vmware.com/kb/2009550
The vCenter Server Appliance installation fails when connecting to an External Platform Services Controller: This issue occurs when the system time on the system hosting the PSC does not match the time of the system where vCenter Server is installed. https://kb.vmware.com/kb/2128811
Configuring the NSX SSO Lookup Service fails (2102041): Connectivity issues between the NSX Manager to vCenter Server due to time skew between NSX Manager and vCenter Server. https://kb.vmware.com/kb/2102041
Authentication Errors are Caused by Unsynchronized Clocks: If there is too great a time difference between the KDC and a client requesting tickets, the KDC cannot determine whether the request is legitimate or a replay. Therefore, it is vital that the time on all of the computers on a network be synchronized in order for Kerberos authentication to function properly. https://technet.microsoft.com/en-us/library/cc780011(v=ws.10).aspx

Timekeeping best practices by VMware
Simply put, when weird behavior during setup or authentication occurs, check the time between the various components first. VMware released multiple knowledge Base articles and technical documents that contain detailed information and instructions on timekeeping within the various components of the virtual datacenter:

Timekeeping in Vmware virtual Machines: http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf
Timekeeping best practices for Windows, including NTP (1318): https://kb.vmware.com/kb/1318
Timekeeping best practices for Linux guests (1006427): https://kb.vmware.com/kb/1006427
ESX and ESXi host time keeping Best Practices: https://kb.vmware.com/kb/2004453

VMware doesn’t provide a separate time keeping best practices document for vCenter, but provides multiple guidelines in the vCenter Server Appliance configuration guide. When installing vCenter on a Windows machine it’s recommended to sync to the PDC emulator within the Active Directory domain. In general, VMware recommends to use native time synchronization software, such as Network Time Protocol (NTP) with the various vSphere components. NTP is typically more accurate than VMware Tools periodic time synchronization and is therefore preferred.
Time synchronization design
There are multiple schools of thoughts when it comes to time sync in a virtual data center. One of the most common ones is to synchronize the virtual datacenter infrastructure components such as ESXi hosts and the VCSA to a collection of external NTP server. Typically provided by http://www.pool.ntp.org/en/ or the US Naval observatory: http://tycho.usno.navy.mil/NTP/. Windows virtual machines sync their time to the Active Directory domain controller running the PDC emulator FSMO role: Time Synchronization in Active Directory Forests http://social.technet.microsoft.com/wiki/contents/articles/18573.time-synchronization-in-active-directory-forests.aspx It’s recommended to point the ESXi hosts to the same time source as the PDC emulator of the active directory. When running Linux best practice is to sync these systems with an NTP server.
Another widely adopted design is to sync ESX servers to the to the Active Directory domain controller running the PDC emulator FSMO role. VCSA time keeping configuration provide two valid options; NTP and hosts. In this scenario, select the host option to ensure time between the host and VCSA is in sync. If the VCSA is using a different time source other than the ESX host, a race condition can occur between time sync operations and can lead to failing the vpxd.

Source: VMware vCenter Server 6.0 Update 1b Release Notes: http://pubs.vmware.com/Release_Notes/en/vsphere/60/vsphere-vcenter-server-60u1b-release-notes.html
But the most interesting thing I witnessed that can easily become a wild-goose chase is the VM tools time synchronization when time on an ESXi host is incorrect. As described earlier, enabling VMware tools time sync on virtual machines was a best practice for a long time. Shifting towards native time synchronization software led VMware to disable the periodic time synchronization option by default. The keyword in the last sentence is PERIODIC. By default VMware tools synchronizes time with the host during the following options:

Resuming a suspended virtual machine
Migrating a virtual machine using vMotion
Taking a snapshot
Restoring a snapshot
Shrinking the virtual disk
Restarting the VMware tools service inside the VM
Rebooting the virtual machine

The time synchronization checkbox controls only whether time is periodically resynchronized while the virtual machine is running. Even if this box is unselected, by default VMware Tools synchronizes the virtual machine’s time to the ESXi host clock after the listed events. If the ESXi host time is incorrect it is likely that “unexplainable” errors will occur. I experienced this behavior after migrating a VM with vMotion. I couldn’t log on to a windows server as the time skew prevented me from authenticating.
You can either disable these options by adding rules to the VMX file of each VM or just ensure that the ESXi host is syncing the time with a proper external time source. For more information: Disabling Time Synchronization (1189) https://kb.vmware.com/kb/1189.
No time zone for ESXi
Be aware that as of vSphere 4.1 ESXi hosts are set to Universal Time Coordinated (UTC) time. UTC is interesting as its the successor of Greenwich Mean Time (GMT) but UTC itself is not a time zone, but a time standard. There are plenty of articles about UTC, but the key thing to understand is that it never observes Daylight Saving Time. As UTC is not a time zone, you cannot change the time notation in ESXi itself. The vSphere client, web client and HTML5 client automatically display the time in your local time zone and will take into account the UTC setting on the host. This isn’t bad behavior, just be aware of this so you don’t freak out when you check the time via the command line.

CMOS clock
ESXi synchronizes its system time with the hardware clock (CMOS/BIOS/ACPI)of the server if the NTP service is not running on the ESXi host. SuperMicro boards allow for NTP synchronization, but most home lab motherboards just provide the time as being configured in the BIOS. When the NTP daemon is started on the ESXi host it synchronizes its system time to the external time source AND it updates the hardware clock as well. I ran a test to verify this behavior. At the time of testing it was 12:37 (GMT+1 | UTC 10:37), NTP turned off and set the time in the BIOS to 6:37 UTC time. After booting the machine the command esxcli system time get confirmed ESX system time retrieved the time from the hardware clock. After starting the NTP Services, the system time was set to the correct time: 10:37.The command esxcli hardware clock get demonstrated that NTP also corrected the BIOS time. A quick BIOS check confirmed esxcli hardware clock get was displaying the BIOS configuration.

If your lab is not connected to the internet, confirm the BIOS time with the command esxcli hardware clock get and if necessary use the command esxcli hardware clock set -d (Day) -H (Hour) -m (Minute) -M (Month) -s (Second) -y (Year) to set the correct time.
Please note that ESXCLI reports time with the Z (Zulu) notation, this is the military name for UTC.
Raspberry Pi as a Stratum-1 NTP Server
When having a home lab, you usually face the age old dilemma common sense vs ‘exciting new stuff that you might not need but you would like to have’. You can update your CMOS clock manually or scripted, you can connect to an array of external NTP servers or you can build your own Stratum-1 NTP server using a Raspberry PI with a GPS add-on board .
Up next in this series:Home Lab fundamentals: Reverse DNS

ntpq -p connection refused error message

May 30, 2016 by frankdenneman

Sometimes a small misconfiguration can cause havoc in a complex distributed system. It becomes really annoying when no proper output is provided by log files and status report. While investigating time issues in my lab I ran into the following error message while executing the ntpq -p command:

TL;DR
NTP client is disabled, enable it via the GUI
The standard NTP query program (ntpq) is one of the quickest way to verify that the Network Time Protocol Daemon (ntpd) is up and running. The command ntpq -p prints a list of peers known to the ESXi host as well as a summary of their state. Running the command on another ESXi host provided the following output.

Requesting the status of the NTPD status on the host with weird time issues, shows it’s not running. No proper feedback is provided by the command line other than it’s starting, no failure code is returned.

Management service initialisation, such as ntpd starts are logged in the file /var/log/syslog.log in ESXi 5.1 and up. Unfortunately, nothing useful is logged in this logfile as well.

I couldn’t find a command that provides accurate output whether the NTP client was enabled or not. Time to open up the web client. Host time configuration can be found when selecting the ESXi host, Manage, Time Configuration. Apparently NTP was not enabled.

Simple problem to fix, unfortunately there is no simple command line function that allows to verify while NTP client is enabled (sans PowerCli)