VCDX- You cannot abstract your way out of things indefinitely

The amount of abstraction in IT is amazing. Every level in the software and hardware stack attempts to abstract operations and details. And the industry is craving for more. Look at the impact “All Things Software Defined” has on todays datacenter. It touches almost every aspect, from design to operations. The user provides the bare minimum of inputs and the underlying structure automagically tunes itself to a working solution. Brilliant! However sometimes I get the feeling that this level of abstraction becomes an excuse to not understand the underlying technology. As an architect you need to do your due diligence. You need to understand the wheels and cogs that are turning when dialing a specific knob at the abstracted layer.

But sometimes it seems that the abstraction level becomes the right to refuse to answer questions. This was always an interesting discussion during a VCDX defense session. When candidates argued that they weren’t aware of the details because other groups were responsible for that design. I tend to disagree

What level of abstraction is sufficient?
I am in the lucky position to work with PernixData R&D engineers and before that VMware R&D engineers. They tend to go deep, right down to the core of things. Discussing every little step of a process. Is this the necessary level of understanding the applied technology and solutions for an architect? I don’t think so. It’s interesting to know, but on a day-to-day basis you don’t have to understand the function of ceiling when DRS calculates priority levels of recommendations. What is interesting is to understand what happens if you place a virtual machine at the same hierarchical level as a resource pool filled with virtual machines. What is the impact on the service levels of these various entities?

Something in the middle might be the NFS series of Josh Odgers. Josh goes in-depth about the technology involved using NFS datastores. Virtual SCSI Hard Drives are presented to virtual machines, even when ESXi is connected to an NFS datastore. How does this impact the integrity of I/O’s? How does the SCSI protocol emulation process affect write ordering and of I/O’s of business critical applications. You as the virtual datacenter architect should be able to discuss the impact of using this technology with application owners. You should understand the potential impact a selected technology has on the various levels throughout the stack and what impact it has on the service it provides.

Recently I published a series on databases and what impact their workload characteristics have on storage architecture design. Understanding the position of a solution in the business process allows an architect to design a suitable solution. Lets use the OLTP example. Typically OLTP databases are at the front of the process, customer-facing process, dramatically put they are in the line of fire. When the OLTP database is performing slow or is unavailable it will typically impact revenue-generating processes. This means that latency is a priority but also concurrency and availability. You can then tailor your design to provide the best services to this application. This is just a simplified example, but it shows that you have to understand multiple aspects of the technology. Not just the behavior of a single component. The idea is to get a holistic view and then design your environment to cater the needs of the business, cause that’s why we get hired.

Circling back to the abstraction and the power of software defined, I though the post from Bart Heungens was interesting. Bart argues that Software Defined Storage is not the panacea for all storage related challenges. Which is true. Bart illustrates an architecture that is comprised of heterogeneous components. In his example, he illustrates what happens when you combine two servers HP DL380, but from different generations. Different generations primarily noticeable from a storage controller perspective and especially the way software behave. This is interesting on so many levels, and it would be a very interesting discussion if this were a VCDX defense session.

SDS abstracts many things, but it still relies on the underlying structure to provide the services. From a VCDX defense perspective, Bart has a constraint. And that is the already available hardware and the requirement to use these different generation hardware in his design. VCDX is not about providing the ideal design, but showing how you deal with constrains, requirements and demonstrating your expertise on technology how it impacts the requested solution. He didn’t solve the problem entirely, but by digging in deeper he managed to squeeze out performance to provide a better architecture to service the customers applications. He states the following:

Conclusion: the design and the components of the solution is very important to make this SDS based SAN a success. I hear often companies and people telling that hardware is more and more commodity and so not important in the Software Defined Datacenter, well I am not convinced at all.
I like the idea of VMware that states that, to enable VSAN, you need and SAS and SSD storage (HCL is quite restricted), just to be sure that they can guarantee performance. The HP VSA however is much more open and has lower requirements, however do not start complaining that your SAN is slow. Because you should understand this is not the fault of the VSA but from your hardware.

So be cognizant about the fact that while you are not responsible for every decision being made when creating an architecture for a virtual datacenter, you should be able to understand the impact various components, software settings and business requirements have on your part of the design. We are moving faster and faster towards abstracting everything. However this abstraction process does not exonerate you from understanding the potential impact it has on your area of responsibility

VCDX Defence: Are you planning to use a fictitious design?

This week the following tweet caught my eye:

Apparently Marc Brunstad (VCDX program manager) stated this fact during the PEX VCDX workshop. But what does this stat mean and to what level do you need to take this into regard when submitting your own design?

During my days as a panel member, I’ve seen only a handful of fictitious designs and although they were technically sound, the reasoning and defense were usually not that strong. Be aware that the VCDX program isn’t born into existence to find the best design ever. It determines if the candidate has aligned the technical functionality with the customers’ requirements, the constraints provided by the environment and the assumptions the team made about for example future workloads or organizational growth.

But does that mean that you shouldn’t use any fictitious element in your design? Are fictiticous elements inherently bad? I don’t think so. Speaking from own experience I made some adjustments to my design I submitted.

My submitted design was largely based on the environment that I worked on for a couple of years. At that time the customer used rack-based systems, my design contained a blade architecture. The reason why I changed this, as it allowed me to demonstrate my knowledge of the HA stack featured in vSphere 4.1. Some might argue that I deliberately made my design more complex, but I was comfortable enough to defend my choices and explain High Availability Primary and Secondary node interaction and how to mitigate risk.

More over it allowed me to demonstrate the pros and cons of such a design on various levels, such as the impact it had on operational processes, the influence on scalability and the alignment of availability policies to org-defined failure domains. Did I have these discussions in real life? Yes, with many other customers but just not with that specific customer that this design was based on.

And that’s why complete fictitious designs fail and why most reasoning is incomplete. The candidate only focused on the alignment of technical specs and workload. Not the “softer” side of things.

Arguing that this design element was just the wish of a customer just doesn’t cut it. Sure we all met customers that were strung on having that particular setting configured in the way they saw fit, but its your responsibility to explain to the panel which steps you took to inform the customer about the risk and potential impact that setting had. Try to explain which setting you would have used and why. Demonstrate your knowledge about feasible alternatives.

My recommendation to future candidates; when incorporating a specific fictitious design element in your design, make sure you had a conversation with a customer about that element once. You can easily align this with the main design and it helps to recollect the specifics during your defense.

VCDX defend clinic: Choosing between Multi-NIC vMotion and LBT

A new round of VCDX defenses will kickoff soon and I want to wish everyone that participates in the panel session good luck. Usually when VCDX panels are near, I receive questions on how to prepare for a panel. And one recommendation I usually provide is

“Know why you used a specific configuration of a feature and especially know why you haven’t used the available alternatives”.

Let’s have some fun with this and go through a “defend clinic”. The point of this clinic is to provide you an exercise model than you can use for any configuration, not only for a vMotion configuration. It helps you to understand the relationship of information you provide throughout your documentation set and helps you explain how you derived through every decision to come to this design.

To give you some background, when a panel member is provided the participants documentation set, he enters a game of connecting the dots. This set of documents are his only view into the your world while creating the design and dealing with your customer. He needs to take your design and compare it to the requirements of the customer, the uncertainties you dealt with in the form of assumptions and the constraints that were given. Reviewing the design on technical accuracy is only a small portion of the process. That’s just basically checking to see if you are using your tools and material correctly, the remaining part is to understand if you build the house to the specification of the customer while dealing with regional laws and the available space and layout of the land. Building a 90.000 square feet single floor villa might provide you the most amount of easily accessible space, but if you want to build that thing in downtown Manhattan you’re gonna have a bad time. ;)

Structure of the article
This exercise lists the design goals and its influencers, requirements, constraints and assumptions. The normal printed text is architects (technical) argument while the paragraphs are displayed in Italic can be seen as questions or thoughts of a reviewer/panel member.

Is this a blue print on how to beat the panel? No! It’s just an exhibition on how to connect and correlate certain statement made in various documents. Now let’s have some fun exploring and connecting the dots in this exercise.

Design goal and influencers
Your design needs to contain a vMotion network as the customer wants to leverage DRS load balancing, maintenance mode and overall enjoy the fantastic ability of VM mobility. How will you design your vMotion network?

In your application form you have stated that the customer want to see a design that reduces complexity, increases scalability, prefers to have the best performance available as possible. Financial budget and the amount of IP-addresses are constraints and the level of expertise of the virtualization management team is an assumption.

Listing the technical requirements
Since you are planning to use vSphere 5.x you have the choice to create a traditional single vMotion-enabled VMKnic, Multi-NIC vMotion setup or use vMotion configuration that uses “Route based on physical NIC load” load balance algorithm (commonly known as LBT) to distribute vMotion traffic amongst multiple active NICs. As the customer does not prefer to use link aggregation, IP-hash based / EtherChannel configurations are not valid.

First let’s review the newer vMotion configurations and how they differentiate from the traditional vMotion configuration, where you have one single VMKnic, a single IP address, connected to a single Portgroup which is configured to use an active and standby NIC?

Multi-NIC vMotion
• Multiple VMKnics required
• Multiple IP-addresses required
• Consistent configuration of NIC failover order required
• Multiple physical NICs required

Route based on physical NIC load
• Distributed vSwitch required
• Multiple physical NICs required

It goes without saying that you want to provide the best performance possible that leads you into considering using multiple NICs to increase bandwidth. But which one will be better? A simple performance test will determine that.

VCDX application form: Requirements
In your application document you stated that one of the customer requirements was “Reducing complexity”. Which of the two configurations do you choose now, what are your arguments? How do you balance or prioritize performance over complexity reduction?

If Multi-NIC vMotion beats LBT configuration in performance, leading to faster maintenance mode operations, better DRS load balance operations and overall reduction in lead time of a manual vMotion process, would you still choose the simpler configuration over the complex one?

Simplicity is LBTs forte, just enable vMotion on a VMKnic, add multiple uplinks, set them to active and your good to go. Multi-NIC vMotion exists of more intricate steps to get a proper configuration up and running. Multiple vMotion-enabled VMKnics are necessary, each with their own IP-range configuration, secondly vMotion requires deterministic path control, meaning that it wants to know which path is selects to send traffic across.

As the vMotion load balancing process is higher up in the stack, NIC failover orders are transparent for vMotion. It selects a VMKnic and assumes it resembles a different physical path then the other available VMKnics. That means its up to the administrator to provide these unique and deterministic paths.

Are they capable of doing this? You mentioned the level of expertise of the admin team as an assumption, how do you guarantee that they can execute this design, properly manage it for a long period and expand the design without the use of external resources?

Automation to the rescue
Complexity of technology by itself should not pose a problem, its how you (are required to) interact with it that can lead to challenges. As mentioned before Multi-NIC vMotion requires multiple IP-addresses to function. On a side note this could put pressure on the IP-ranges as all vMotion enabled VMKnics inside the cluster requires being a part of the same network. Unfortunately routed vMotion is not supported yet. Every vMotion VMKnic needs to be configured properly, Pair this with availability requirements and the active and standby NIC configuration of each VMKnic can cause headaches if you want to have a consistent and identical network configuration across the cluster. Power-CLI and Host Profiles can help tremendously in this area.

Supporting documents
Now have you included these scripts in your documentation? Have you covered the installation steps on how to configure vMotion on a distributed switch? Make sure that these elements are included in your supporting documents!

What about the constraints and limitations?

Licensing
Unfortunately LBT is only available in distributed vSwitches, resulting in a top-tier licensing requirement if LBT is selected. The LBT configuration might be preferred over Multi-NIC vMotion configuration because it provides the least amount of complexity increase over the traditional configuration.

How does this intersect with the listed budget constraint and the customer is not able –or willing – to invest in enterprise licenses?

IP4 pressure
One of the listed constraints in the application form is the limited amount of IP addresses in the available IP range destined for the virtual infrastructure. This could impact your decision on which configuration to select. Would you “sacrifice” the amount of IP-s to get a better vMotion performance and all the related improvements on the remaining dependent features or is scalability and future expansion of your cluster more important? Remember scalability is also listed in the application form as a requirement.

Try this at home!
These are just an example of questions that can be asked during a defense. Try to find these answers when preparing for you VCDX panel. When finalizing the document set, try to do this exercise. Even better to find a group of your peers and try to review each others design while reviewing the application form and the supporting set of documents. At the Nordic VMUG Duncan and I spoke with a group of people that are setting up a VCDX study group, I think this is a great way of not only preparing for a VCDX panel but to learn and improve your skill set you can use in your daily profession.

VCDX tip: The application form

Last week I reviewed some recently submitted designs and it appears that the requirements stated in the application form are too ambiguous. During this year I’ve seen many application forms and the same error are made by many candidates. Let’s go over the sections which contain the most errors and try to remove any doubts for future candidates.

The VMware VCDX Handbook and application form is subject to change. So this article is based on version 1.0.5. The application form is available for candidates enrolled in the VCDX program.

Section 4 Project References
What deliverables were provided? (This should represent a comprehensive design package and include, at a minimum, the design, blueprints, test plan, assembly and configuration guide, and operations guide.)

Ok so this requirement is not understood clearly by some. To meet this requirement you MUST submit at least:

1. the VMware VI 3.5 or vSphere design document.
2. blueprints (Visio drawings of physical and logical layout)
3. a documented test plan
4. a assembly and configuration guide
5. and a operation guide.

This means you are required to submit those five listed documents otherwise your application is rejected (bad) or returned for rework (Still bad, but it doesn’t cost you 300 bucks and you might have a chance to defend during the upcoming defense panels).

Section 5 Design Development Activities
This section requires you to submit five requirements, assumptions and constrains that had to be followed within this design.

This means you must submit at least five requirements, five assumptions and five constrains you encountered when working on the design. I’ve seen some application forms with requirements such as enough power, enough floor space and enough cables. Which are all genuine requirements if you are a project manager. We are requesting a list of requirements, assumptions and constraints which you as a virtual infrastructure architect had to deal with. The submitted design needs to align and deal with requirements and constraints listed in the application form.

Design Deliverable Documentation:
A small error made by many, no big deal if you miss this but it makes our live much easier if you do it correctly. This sections requires you to list the page numbers where the diagrams can be found not how many pages the document has.

Design Decisions
In this section you must provide four decision criteria for each of the decision areas, this means if you leave one field empty the application will be rejected.

It’s just really simple; your application form is NOT completed when a field is empty. Not completed forms get rejected.

Application form does not equal design document
The application form is not a substitute for the design document. It is a part of the VCDX certification program and not a part of the VMware virtual infrastructure design. The two are not complimentary to each other. Everything stated in the application form must be included in the design document or any of the other documents. Just remember you are submitting a defense you have delivered to a real or imaginary customer! Ask yourself have you ever submitted a VCDX application form during a design project to your customer?

VCDX tip: VMtools increases TimeOutValue

This is just a small heads-up post for all the VCDX candidates.

Almost every VCDX application I read mentions the fact that they needed to increase the Disk TimeOutValue (HKEY_LOCAL_MACHINE/System/CurrentControlSet/Services/Disk) by to 60 seconds on Windows machines.

The truth is that the VMware Tools installation (ESX version 3.0.2 and up) will change this registry value automatically. You might want to check your operational procedures documentation and update this!

VMware KB 1014