• Skip to primary navigation
  • Skip to main content

frankdenneman.nl

  • AI/ML
  • NUMA
  • About Me
  • Privacy Policy

Machine Learning on VMware Platform – Part 3 – Training versus Inference

June 30, 2022 by frankdenneman

Machine Learning on VMware Cloud Platform – Part 1 covered the three distinct phases: concept, training, and deployment, part 2 explored the data streams, the infrastructure components needed and vSphere can help with increasing resource utilization efficiency of ML platforms. In this part, I want to go a little bit deeper into the territory of training and inference workloads.

It would be best to consider the platform’s purpose when building an ML infrastructure. Are you building it for serving inference workloads, or are you building a training platform? Are there data science teams inside the organization that create and train the models themselves? Or will pre-trained models be acquired? Where will the trained (converged) model be deployed? Will it be in the data center, industrial sites, or retail locations?

From an IT architecture resource requirement perspective, these training and inference workloads differ in computational power and data stream requirements. One of the platform architect’s tasks is to create a platform that reduces the time to train. It’s the data scientist’s skill and knowledge to use the platform’s technology to reduce the time even more without sacrificing accuracy.

This part will dive into the key differences between training and inference workloads and their requirements. It helps you get acquainted with terminology and concepts used by data scientists and apply that knowledge to your domain of expertise. Ultimately, this overview helps set the stage for presenting an overview of the technical solutions of the vSphere platform that accelerate machine learning workloads.

Types of machine learning algorithms

When reviewing popular machine learning outlets and podcasts, you typically only hear about training large models. For many, machine learning equals deep learning with giant models and massive networks that require endless training days. But in reality, that is not the case. We do not all work at the most prominent US bank. We do not all need to do real-time fleet management and route management of worldwide shipping companies or calculate all possible trajectories of five simultaneously incoming tornadoes. The reality is that most companies work on simple models with simple algorithms. Simple models are easier to train. Simple models are easier to test. Simple models are not resource-hogs, and above all, simple models are simpler to update and keep aligned with the rapidly changing world. As a result, not every company is deploying a deep-learning GPT-3 model or massive ResNet to solve their business needs. They are looking at “simpler” machine learning algorithms or neural networks that can help increase the revenue or decrease the business cost without running it on 400 GPUs.

In the following articles, I will cover neural networks, but if you are interested in understanding the basics of machine learning algorithms I recommend looking at the following popular ones:

Support Vector Machines (SVM)

Decision trees

Logistic Regression

Random forest

k-means (not listed in the google search result)

Data Flow

Training produces a neural network model that generates a classification, detection, recommendation, or any other service with the highest level of accuracy. The golden rule for training is that the more data you can use, the higher accuracy you achieve. That means the data scientist will unleash copious amounts of data on the system. Understanding the data flow and the components involved helps you design a platform that can significantly reduce training time.

Most neural networks are trained via the (offline) batch learning method, but the online training method is also used. In this method, the model is trained by feeding it smaller batches of data. The model is active and learns on the fly. Whether it is less resource-intensive than batch learning, or often referred to as offline training, is debatable as the model trains itself continuously. It needs to be monitored very carefully as it can be sensitive to new data that can quickly influence the model. Specific stock price systems deploy ML models that use online training to respond to market trends quickly.

The inference service is about latency, for example, pedestrian identification in autonomous vehicles, packages flying across high-speed conveyor belts, product recommendations, or speech-to-text translations. You simply cannot afford to wait on a response from the system in some cases. Most of these workloads are a single data sample or, at best, a small number of instructions batched up. The data flow of inference is considered to be streaming of nature. As a result, the overall compute load of inference is much lower than that of the training workload. 

TrainingInference
Data FlowBatch DataStreaming Data

Data sets and Batches

During model training, models train with various data sets: training sets, validation sets, and testing sets. The training set helps the model recognize what it should be supposed to learn. The validation dataset is helpful for the data scientist to understand the effect of tuning particular hyperparameters, such as the number of hidden layers or the network layer size. The third dataset is the testing set and proves how well the trained neural network performs on unseen data before being put into production.

A dataset provides the samples used for training. These data sets can be created from company data or acquired from third parties. Or a combination of both, sometimes businesses acquire extra data on top of their own to get better insights into their customers. These datasets can be quite large. An example is a Resnet50 model with Imagenet-1K dataset. Resnet50 is an image classification training network, and the Imagenet-1K dataset contains 1.28 million images (155.84 GiB).

Even the latest NVIDIA GPU generations (Ampere and Hopper) offer GPU devices with up to 80GB of memory and cannot fit that dataset entirely in memory. As a result, the dataset is split into smaller batches. Batch size plays a significant role in the training of neural network models. This training technique is called mini-batch gradient descent. Besides circumventing the practical memory limitation, It impacts the accuracy of models, as well as the performance of the training process.  If you’re curious about batch sizing, read the research paper “Revisiting small batch training for deep neural networks“. Let’s cover some more nomenclature while we are at it.

During the training cycle, the neural network processes the dataset’s examples. This cycle is called an epoch. A data scientist splits up the entire dataset into smaller batch sets. The number of training examples used is called a batch size. An iteration is a complete pass of a batch. The number of iterations is how many batches are needed to complete a single epoch. For example, the Imagenet-1K dataset contains 1.28 million images. Well-recommended batch size is 32 images. It will take 1.280.000 / 32 = 40.000 iterations to complete a single epoch of the dataset. Now how fast an epoch completes depends on multiple factors. A training run typically invokes multiple epochs. 

Both training and inference use batch sizes. In most use cases, inference focuses on responding as quickly as possible. Many inference use-cases ingest and transform data in real-time and generate a prediction, classification, or recommendation. Translating this into real-life use cases, we are talking about predicting stock prices to counting cars in a drive-through. The request needs to be processed the moment it comes in. As a result, no batching to limited batching occurs. It depends on how much workload the system receives when it is operational. By batching 1-4 examples, Inference classes as streaming workload.

Determining the correct batch size for training is a science by itself. Many research papers and Medium articles exist about the sweet spot for batch sizes. There are benefits and disadvantages to be found at any point in the spectrum of batch sizes. Smaller batch sizes can lead to lower memory footprint and improvement of throughput, while larger batch sizes can increase parallelism and decrease the computational cost.

This last factor might not be relevant for a data scientist when training in an on-premises environment, but it’s good to understand. When batches are moved from storage or host memory to GPU device memory, CPU cycles are needed. If you are using larger batches, you reduce the number of computing calls to move data, ultimately reducing your CPU footprint. Off course, there is a downside to this as well, primarily on the performance side of the algorithm, something the data scientist needs to figure out how to solve. Therefore you notice that depending on the use case, you see different batch sizes per model. Two excellent papers that highlight both ends of the spectrum: “Friends don’t let friends use mini-batches larger than 32” and “Scaling TensorFlow to 300 million predictions per second“

The takeaway for the platform architect is that inference is primarily latency-focused. If the inference workload is a video-streaming-based workload for image classification or object detection, the system should be able to provide a particular level of throughput. Training is predominantly throughput based. Batch sizing is a domain-specific (hyper)parameter for the data scientist. Still, it can ultimately affect the overall CPU footprint and whether efficient distributed training is used. Depending on the dataset size, the data scientist can opt for distributed training, dispatching the batches across multiple GPUs.

TrainingInference
Storage CharacteristicsThroughput basedLatency-based, occasionally throughput
Batch SizeMany recommendations between 1-32
Smaller batch size reduces the memory footprint
Smaller batch size increases algorithm performance (generalization) 
Larger batch size increases compute efficiency
Larger batch size increases parallelization (Multi-gpu)
1-4

Data Pipeline and Access Patterns

Data loading is essential to building a deep Learning pipeline and training a model. Remember that everything you do with data takes up memory. Let’s go over the architecture and look at all the “moving parts” before diving into each one.

The dataset is stored on a storage device. It can be a vSAN datastore or any supported network-attached storage platform (NFS, VMFS, vVOLs). The batch is retrieved from the datastore and stored in host memory before it loads into GPU device memory (Host to Device (HtoD)). Once the model algorithm completes the batch, the algorithm copies the output back to host memory (Device to Host – DtoH). Please note that I made a simple diagram and showed the most simple data flow. Typically, we have a dual-socket system, meaning there are interconnects involved, multiple PCI controllers involved, and we have to deal with VM placement regarding the NUMA locality of the GPU. But these complex topics are discussed later in another article. one step at a time.

We immediately notice the length of the path without going into the details of NUMA madness. Data scientists prefer that the dataset is stored as close to the accelerator as possible on a fast storage device. Why? Data loading can reduce the training time tremendously. Quoting Gorkem Polat, who did some research on his test environment:

One iteration of the ResNet18 Model on ImageNet data with 32 batch size takes 0.44 seconds. For 100 epochs, it takes 20 days! When we measure the timing of the functions, data loading+preprocessing takes 0.38 seconds (where 90% of this time belongs to the data loading part) while the optimization (forward+backward pass) time takes only 0.055 seconds. If the data loading time is reduced to a reasonable time, full training can be easily reduced to 2.5 days! Source

Most datasets are too large to fit into the GPU memory. Most of the time, it does not make sense to preload the entire dataset into host memory. The best practice is to prefetch multiple batches and thereby mask the latency of the network. Most ML frameworks provide built-in solutions for data loading. The data pipeline can run asynchronously with training as long as the pipeline prefetches several batches to keep it full. The trick is to keep multiple pipelines full, where fast storage and low-latency and high throughput networks come into play. According to the paper “ImageNet training in Minutes,” it takes an Nvidia M40 GPU 14 days to finish just one 90-epoch Resnet-50 training execution on the ImageNet-1k dataset. The M40 was released in 2015 and had 24GB of memory space. As a result, data scientists are looking at parallelization, distributing the workload across multiple GPUs. These multiple GPUs need to access that dataset as fast as possible, and they need to communicate with each other as well. There are multiple methods to achieve multi-GPU accelerator setups. This is a topic I happily reserve for the next part. 

Dataset Random Read Access

To add injury after insult, training batch reads are entirely random. The API lets the data scientist specify the number of samples, and that’s it. Using a Pytorch example:

train_loader = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True, num_workers=4)

The process extracts 32 random examples from the dataset and sends them over as a batch. The command Shuffle=true is for what happens after the Epoch completes. This way, the next Epoch won’t see the same images in the same order. Extracting 32 random examples from a large dataset on a slow medium won’t help reduce the training time. Placing the dataset on a bunch of spindles would drive (pun intended) your data science team crazy. Keeping the dataset on a fast medium and possibly as close to the GPU device as possible is recommended.

TrainingInference
Data AccessRandom Access on large data set
Multiple batches are prefetches to keep the pipeline full
Fast storage medium recommended
Fast storage and network recommended for distributed training
Streaming Data

The next part will cover the memory footprint of the model and numerical precision.

Filed Under: AI & ML

Unexplored Territory Podcast Episode 18 – Not just artificially intelligent featuring Mazhar Memon

June 13, 2022 by frankdenneman

In this week’s Unexplored Territory Podcast, we have Mazhar Memon as our guest. Mazhar is one of the founders of VMware Bitfusion and the principal inventor of Project Radium. In this episode, we talk to him about the start of Bitfusion, what challenges Project Radium solves, and what role the CPU has in an ML world. If you like deep-dive podcast episodes, grab a nice cup of coffee or any other beverage of your liking, open your favorite podcast app, strap in and press play.  

Listen to the full Unexplored Territory Podcast episode via Spotify – https://spoti.fi/3QdnXlX Apple – https://apple.co/3O7TsMj, or with your favorite podcast app.

Links and topics discussed during the episode:

  • Techcrunch demo – https://www.youtube.com/watch?v=p3cAzt1PLBA
  • Intro to Radium – https://octo.vmware.com/introducing-project-radium/
  • IPUs and Radium – https://octo.vmware.com/vmware-and-graphcore-collaborate-to-bring-virtualized-ipus-to-enterprise-environments/

You can follow us on Twitter for updates and news about upcoming episodes: https://twitter.com/UnexploredPod.

Also, make sure to hit that subscribe button, rate where ever possible, and share the episode with your friends and colleagues! And for those who hadn’t seen it, we made the Top 15 Podcast list on feedspot, the first non-corp branded podcast on the list!

Filed Under: AI & ML

MACHINE LEARNING ON VMWARE PLATFORM – PART 2

June 8, 2022 by frankdenneman

Resource Utilization Efficiency

Machine learning, especially deep learning, is notorious for consuming large amounts of GPU resources during training. However, as the last part already highlighted, machine learning is more than just training a model. And these components within the machine learning workflow require large amounts of CPU, memory, storage, and network resources.

Machine Learning on VMware Cloud Platform – Part 1 covered the three distinct phases: concept, training, and deployment. Existing “known data” is required to explore and train the model in both the concept and training phases. During the development of the model, it is common to use three different data sets: the training set, the validation set, and the testing set. Creating data sets is not only about getting as much data as possible. It is even more critical getting meaningful data and high levels of quality because the accuracy of the recommendation produced by the model is highly dependent on the quality of the dataset used for training and validation. The data science team needs to “wrangle” existing raw data into shape to get such a high-quality dataset. Data wrangling transforms the raw data into more valuable data that can be used as a dataset “downstream” to train a model. And all this wrangling requires a lot of collateral infrastructure and services besides just a bunch of GPUs.

Massive amounts of datasets are not new to any enterprise IT organization. “System of records” has always been the backbone of business processes. Think back about mainframes in the ’70s and the rise of on-premises ERP systems like SAP and Oracle in the ’80s and ’90s. And today, databases, data warehouses, and data lakes contain petabytes of data. A 100-gigabyte dataset used for training purposes might sound unimpressive. But there is a difference. This dataset is constantly on the move. Bits and bytes do not slowly accrue in a database and sit there waiting to be cherry-picked by a SQL statement. No, these datasets are pulled and pushed through the infrastructure. It will be extracted from multiple sources, transformed by different platforms, and stored and versioned many times over. For example, many ML projects capture and store massive amounts of unstructured data (video, audio, log files) in its native format in systems that need to be structured, transformed, and analyzed. These processes generate a lot of data movement, which, in virtualized platforms, burns (a lot of) x86 cycles and exposes data on the network.

Below is a diagram highlighting some of the structural components of such an environment, omitting a lot of essential data science teams tools such as the various collaborative Juypter notebook solutions, artifacts stores, or complete data science or MLops platforms such as Databricks, Domino, H20.ai, DataRobot, or Dataiku.

If we look at the processes after training, they belong to the deployment phase. In this phase, the data science team, or the MLops team, takes a converged model and integrates it into a system or platform that engages with the customer or an end system, like a robot arm or factory installation. A converged model is a model that is trained up to a state where additional training will not improve the model. Why not say, finished model? As the world changes, the model might not be trained to reflect the current state of the world. Think about what happened during Covid and the ML models deployed in the hotel, airliner, and entertainment industry. They needed some readjustments to reflect the current situation in the world. And because of this, some models need retraining. This retraining does not happen in real-time. It depends on the use case and the data flow. Think about self-scanning registers, holiday season packing changes for supermarkets, or additional items in a warehouse. In general, most teams start with manual retraining. Still, they want to move to automation and look at CI/CD pipelines, create a cyclical deployment process, retrain new models, and capture un-seen data as new training sets.

As you can imagine, running such a platform and its (pipeline) components requires a lot of processing power. On top of that, in many organizations, AI\ML projects are initiated by individual business units attempting to solve their pressing business challenge. Each team runs its solution tech stack and develops its models along its development lifecycle with its resource utilization. What we see today is the start of ML-projects sprawl. Many ML projects start small, with small data sets, starting on laptops, some small datasets, and slowly grow and become essential to the business. And we attach the same organization and process models to this phenomenon. Someone will detect this, start to ask for more consolidation, start a Machine Learning Center of Excellence and start to think of centralizing resource utilization.

And this is the right way of thinking. You do not want each team to erect its own isolated platform. We’ve seen this happen in the ’90s when organizations moved away from mainframes to decentralized X86 servers. A lot of these machines were underutilized. We see now that a lot of data scientists are simply not aware of the power of virtualization. It’s cloud or bare-metal. Cloud platforms are great to start but are too opinionated, and thus they turn to bare-metal. They leave out the greatest thing since sliced bread (now, I might be a bit biased here).

Let’s use two examples from the concept and training phase. When the data scientist team performs a “Hyperparameter search,” they use multiple smaller GPU-equipped machines and smaller data sets to find the correct ML model architecture and model. Or, when they are using their Spark cluster for intensive data processing tasks, they typically only concentrate on the pre-processing task. The key here is that an ML platform mainly consists of many parts, but only one part is heavily utilized. From a resource utilization perspective, we have to deal with load-correlated utilization spikes per model development since many platform parts are originally designed as a distributed architecture.

I can imagine that all of the above can sound extremely complex, but in many cases, it isn’t. If organizations centralize their efforts onto a single platform, we have to deal with this workload’s noisy-neighbor aspect. But, we all have dealt with noisy neighbors before, and the best part is that a lot of the core parts of vSphere are updated to handle new distributed workloads. For example, DRS 2.0. Runs every 60 seconds instead of every 5 minutes to handle containerized workloads and focuses on workload happiness instead of dealing at a high level of balancing host utilization across a vSphere cluster. The partnership with NVIDIA that brought us NVIDIA AI Enterprise allows us to spatially partition the GPU to isolate compute resources and allow full multi-tenancy.

See the recent blogpost of Lan Vu about how the different vGPU technology can best suit the ML use case )

But we are also working on newer technologies with our partners to think about the heavy IO streams that ML model development will introduce to the vSphere platform. Project Monterey introduces the Data Processing Unit (DPU) into our cluster architecture.

There are many use-cases, but the one that excites me is the sheer amount of innovative network IO offload you can generate with DPUs and potentially more intelligent things with NVIDIAs Bluefield architecture. Nowadays, every network IO within a virtualization platform consumes an X86 of the ESXi host—more than one X86 cycle. And so, pulling and pushing datasets and datasets for many different models through the platform will impact the X86 cycles left over to run the rest of the virtual machines and containers used for the toolchain for the data scientists and other employees and services consuming the virtualization platform. By introducing DPUs, you isolate these network IO streams from the compute layer.

Another project with immense potential for the ML platform is project Capitola, or as the vSphere team calls it, the Software-Defined Memory structure. It would be best to have copious amounts of storage space during the pre-processing data phase, but you also want it fast. But you might not want to break the bank and spend your entire annual IT budget on RAM modules. Project Capitola allows you to use different memory technologies and offer them different tiers of memory capacity to your workload without rewriting your applications.

The critical challenge is to have a platform that can provide the right resources and attach and detach the resources efficiently and economically so that the data science team and the organization benefit from this. The platform needs to provide the workload environment as quickly as possible (self-service) and be resilient. I’ll dive into risk mitigation in the next part.

If you want to learn more about project Monterey, Capitola, or the self-service part, please join Cormac Hogan, Duncan Epping, and me at our (Virtual) Roadshow “The ever-evolving VMware Infrastructure” . Ask your VMUG leader about the possibilities.

Filed Under: AI & ML

Machine Learning on VMware Platform – Part 1

May 25, 2022 by frankdenneman

Machine Learning is reshaping modern business. Most VMware customers look at machine learning to increase revenue or decrease cost. When talking to customers, we mainly discuss the (vertical) training and inference stack details. The stack runs a machine learning model inside a container or a VM, preferably onto an accelerator device like a general-purpose GPU. And I think that is mostly due to our company DNA letting us relate machine learning workload directly to a hardware resource. 

But Machine Learning (ML) allows us to think much broader than that. One of the most cited machine learning papers, “Machine Learning: The High-Interest Credit Card of Technical Debt,” shows us that running ML code is just a tiny part of the extensive and complex hardware and software engineering system. 

Machine Learning Infrastructure Platform

As of today, VMware’s platform can offer much functionality to ML practitioners already. We can break down a machine learning platform into horizontal fabrics connecting the vertical training and inference stack. An example of a horizontal fabric can be a data pipeline that runs from ingesting data all the way to feeding the training stack a training data set. Or it can be a pipeline that streams data from a sensor to an ISV pre-trained model and goes all the way to control the actions of a robot arm. All the functions, applications, and services are backed by containers and VMs that could run on a VMware platform.

Looking beyond the vertical training stack, we immediately see familiar constructs we have spent years caring for as platform architects. The majority of databases run on vSphere platforms. Many of these contain the data; data scientists want to use to train their models. Your vSphere platform can efficiently and economically run all the services needed to extract and transform the data. The storage services safely and securely store the data close to the computational power. The networking and security services, combined with the core services, can offer intrinsic security to the data stream from the point of data ingestion all the way to the edge location where the model is served.

This service allows each data science team to pull software to their development environment whenever they need it. Using self-service marketplace services, such as “VMware Application Catalog” (formerly known as Bitnami ), allows IT organizations to work together with the head of data science to curate their ML infrastructure toolchains. 

Improving ROI of low utilized resources

If we focus on the vertical training and inference stack first, we can apply the same story (I at least) discussed for a long time. Instead of talking about X86 cycles and memory utilization, we now focus on accelerator resources. Twenty years ago, we revolutionized the world by detaching the workload from the metal by introducing the virtual machine. The compelling story was that the server hardware was vastly underutilized, typically around 4%. This convinced customers to start virtualizing. By consolidating workload and using different peak times of these workloads, the customer could enjoy a better return on investment of the metal. This story applies to today’s accelerators. During GTC fall, it was quoted that today’s utilization of accelerators is below 15%. 

Accelerators are expensive, and with the current supply chain problems, very challenging to obtain these resources. Let alone scale-out. 

The availability and distribution of accelerator resources within the organizations might become a focal point once the organization realizes that there are multiple active projects and the current approach of dedicated resources is not economically sustainable. Or the IT organization is looking to provide a machine learning platform service. The pooling of accelerator resources is beneficial to the IT organization from an economic perspective, plus it has a positive effect on the data science teams. The key to convincing the data science teams is understanding the functional requirements of the phases of the model development lifecycle and deploying an infrastructure that can facilitate those needs.

A machine learning model follows a particular development lifecycle, Concept – Training – Deployment (inference).

Concept phase: The data science team determines what framework and algorithm to use during the concept phase. During this stage, the DS team explores what data is available, where the data lives and how they can access it. They study the idea’s feasibility by running some tests using small data sets. As a result, the team will run and test some code, with lots of idle time in between. The run time will be seconds to minutes when the code is tested. As you can imagine, a collection of bare metal machines assigned to individual data scientists or teams with dedicated expensive GPUs might be overkill for this scenario.

In many cases, the CPU provides enough power to run these tests. Still, if the data science team wants to research the effect and behavior of the combination of the model and the GPU architecture, virtualization can be beneficial. This moment is where pooling and abstraction, two core tenets of VMware’s DNA, come into play. We can consolidate the efforts of different data science teams in a centralized environment and offer fractional GPUs (NVIDIA vGPU) or remoting with VMware Bitfusion (intercept and remoting of CUDA API). Or use multi-instance GPU (MIG) functionality that allows for stricter resource isolation and thus predictable performance. These functionalities allow the organization to efficiently and economically provide an infrastructure for the data science teams to run their workload when they are in the testing phase of developing a particular model. 

Training phase: A virtualized platform is also advantageous for the training phase of the model. During the training phase of the model development lifecycle, the data science team pushes workload for longer times, which is throughput-oriented. Typically they opt for larger pools of parallel processing power. In this phase, CPU resources aren’t cutting anymore, and most of the time, a part of a single physical GPU isn’t enough either. Depending on their performance needs, the customer can use remoting technology (Bitfusion) to pools of GPU cards or use PCI passthrough device technology for running the workload directly on one or more physical devices. 

But the hungry nature of the workload still should not warrant a dedicated infrastructure as training jobs are transient in nature. Sometimes a training job is 20 hours, and sometimes 200 hours. The key point is that idle time of GPU resources occurs after the training job is finished. The data science team reviews the results of the training job and discusses which hyperparameter to tune. This situation is very inefficient. The costly accelerators are idling while other data science teams struggle for resources. Virtualization platforms can democratize the resources and help scale-out resources for the data science teams. Convince data science teams to “give up” their dedicated infrastructure is based on helping them understand the nature of virtualization and how pooling and abstracting can benefit their needs. 

Instead of having isolated pools of GPUs scattered throughout the organizations, virtualization allows data science teams to have a centralized pool of resources at their disposal. When they need to run a training job, they can allocate far more (spare and unused) resources from that pool than the number of resources they could have allocated in their dedicated workstation. In return for giving up their dedicated resources, they have a larger pool of resources at their disposal, reducing the duration of training time.

Deployment phase: The deployment phase is typically performance-focused in nature. This is where we meet the age-old discussion of hypervisor overhead. The hypervisor introduces some overhead, but the performance teams are working on this to get this gap reduced. We are now at the single-digit mark (+-6%), and in some cases near bare-metal performance, and thus other benefits should be highlighted. Such as the ability to deal with heterogeneous architecture due to using virtual machines for workloads or running Kubernetes in VMs. The VMware ecosystem allows for deploying and managing at scale and focuses on delivering lifecycle management at scale. Portability and mobility of VMware’s workload constructs, all these unique capabilities matter to mature organizations and offset a single-digit performance loss. 

The Data Science Team

In general data scientist team ultimately focuses on delivering machine learning models, their MBOs state to deliver these models as fast as possible, with the highest level of accuracy. Although light coding (Python, R, Scala) is a part of their daily activities, most data scientists do not have a software engineering background and thus can be seen as a step above developers. As a result, our services should not treat them as developers but align more with the low-code, no-code trend seen in the AI world today. Data Scientists do not and should not have a deep understanding of infrastructure architectures. Their lowest level of abstraction is predominantly a Docker container. Kubernetes is, in most cases, a “vendor” requirement. Any infrastructure layer underneath the container should be of no concern to the data scientist, and our platform should align with that thought.

The data scientist should not learn how to optimally configure their data and machine learning platform to operate on the VMware platform. The VMware Cloud platform should abstract these decisions away from the data scientist and provide (automatic) constructs for ML Ops teams, IT architects, and central operation teams to run these applications and services optimally. 

Data scientists are primarily unaware of IT policies. They have machines and software often not included in the corporate IT service catalog. This situation can be a massive security liability as they consistently work with the crown jewels of the company, highly sensitive company data. Data is now exposed, usually not well secured, a prime target for ransomware or other malicious practices. One of the examples that our field teams repeatedly see is that this data is often stored on very expensive shiny Alienware laptops. This data needs to be protected and should not be stored on high-risk target devices. 

These devices are bought as the current IT platforms lack the hardware accelerators or any collaborative containerized service platform. Embezzlement of these devices leaks out company data and halts the model development progress. Besides both of these problems, an organization does not want to have its highly paid data science team twiddling its thumbs. Having a platform ready to cater to the needs of the data science teams, both in software and hardware requirements, eliminates these threads and helps the organizations benefit from the many benefits of a centralized platform.

The current method of providing a data scientist team with highly specialized, portable equipment is an extreme liability from a security perspective. Many data science teams don’t see the centralized IT team as an enabler due to the lack of service alignment. Providing a centralized platform with the right services and equipment can help the organization severely reduce the security risk while improving the data science teams’ capability to deliver a model by leveraging all the advantages of a centralized platform.

Centralizing ML Infrastructure Platforms

If we take it one step further, you can obtain a higher level of economics by running the ML infrastructure platforms on the shared infrastructure instead of dedicated physical machines. As described in the previous paragraph, data science teams operate with full autonomy in many organizations and have dedicated resources at their disposal. Individual business groups within the organizations are forming their data science teams, and due to this decentralized effort, a lot of scattered high-powered machines around the organization create pools of fragmented compute and storage resources. By centralizing the different ML infrastructure services workload of different teams on one platform, you benefit from the many advantages the VMware ecosystem provides:

  • Resource Utilization Efficiency
  • Risk Mitigation
  • Workload Construct Standardization and Life Cycle Management
  • Data adjacency
  • Intrinsic security for data streams
  • Open platform

But I’ll leave these topics for part 2 of this series

Filed Under: AI & ML

Stop designing your server platform with solely the CPU roadmap in mind

December 20, 2021 by frankdenneman

Over the last 20 years, we designed our core data center platform following the CPU roadmap. But in today’s world, the devices attached to the processor make radical and revolutionary improvements, catering to the needs of the new workloads. I’m talking about devices like the GPU, the network adapter, and its natural offspring, the data processing unit (DPU). In the article “Project Monterey and the need for network cycles offload for ML workloads” I zoom into what’s in store for us data center architects in the upcoming years.

To service the request of these new workloads, we need to move away from designing a platform solely based on a CPU roadmap and plug these devices in a server as an afterthought. When designing a platform for these new workloads, we have to start holistically designing data center systems.

Together with Luke Wignall from NVIDIA, we (Duncan Epping, Johan van Amersfoort, and I) discuss DPU technology and other efforts to run and manage modern workloads in episode 5 of the Unexplored Territory podcast.

Apple: apple.co/3lYZGCF

Google: bit.ly/3oQVarH

Spotify: spoti.fi/3INgN3R

Or anywhere else where you get your podcasts!

Filed Under: AI & ML Tagged With: AI, ML, NVIDIA

  • « Go to Previous Page
  • Page 1
  • Interim pages omitted …
  • Page 4
  • Page 5
  • Page 6
  • Page 7
  • Page 8
  • Go to Next Page »

Copyright © 2025 · SquareOne Theme on Genesis Framework · WordPress · Log in