• Skip to primary navigation
  • Skip to main content

frankdenneman.nl

  • AI/ML
  • NUMA
  • About Me
  • Privacy Policy

VMware Private AI Foundation – Privacy and Security Best Practices white paper

July 1, 2024 by frankdenneman

I’m excited to announce the release of my latest white paper, “VMware Private AI Foundation – Privacy and Security Best Practices.” As many of you know, the world of artificial intelligence is rapidly evolving, and with that comes a new set of challenges, particularly around privacy and security.

This white paper is not just about theory. It’s a practical guide introducing the foundational concepts, frameworks, and models underpinning private AI security. It’s a deep dive into the critical aspects of privacy and security in the context of AI, providing you with the tools to implement these principles in your own work. You’ll learn about the principle of shared responsibility, threat modeling for Gen-AI applications, and the CIA triad – confidentiality, integrity, and availability – as a guiding model for information security.

And that’s not all. In the near future, we’ll be taking these concepts to the next level with a follow-up white paper focused on VMware Cloud Foundation (VCF) settings. This paper will be your go-to resource for detailed guidance on configuring and optimizing VCF to establish a robust and secure private AI environment.

Stay tuned for this next installment, where we’ll bridge the gap between theory and practice, empowering you to build and deploy private AI solutions confidently.

Thank you for your continued support, and I look forward to hearing your feedback!

Filed Under: Uncategorized

The misconception of self-learning capabilities of Large Language Models during Production

February 22, 2024 by frankdenneman

I enjoyed engaging with many customers about bringing Gen-AI to the on-prem data center at VMware Explore. Many customers want to keep their data and IP between the four walls of their organization, and rightly so.

With VMware Private AI Foundation, we aim to utilize foundation models and build upon the great work of many smart data scientists. Foundation models like Llama 2, StarCoder, and Mistral 7b. Instead of building and training a large language model (LLM) from the ground up, which can be time-consuming and computationally expensive, organizations can leverage foundation models pre-trained on a massive dataset of text and code. If necessary, organizations can further fine-tune a foundation model on specific tasks and data in a short period of time.

At VMware, we believe in using Vector DBs with Retrieval Augmented Generation (RAG) to decouple the IP and data from the foundation model. Using RAG, you offload the knowledge updates to another system so the Gen-AI application is always up to date. The vector DB is used for memorizing facts, while the foundation model is used for reasoning functionality. If necessary, the foundation model can be replaced by a newer version. Typically, this doesn’t happen, but if a data science team thinks it will improve the Gen-AI application’s reasoning or generative capability, they can do that without losing any IP.

And this particular fact, not losing any IP by replacing the model, got me some pushback from the people I spoke to. By digging into this topic a bit more, I discovered misconceptions among many about the learning ability of a neural network (LLM) model.

When you use an LLM, i.e., asking it a question (prompt), the model does NOT learn from your question. LLMs have no self-learning mechanisms during the deployment phase (inference). Let’s dive a bit deeper into the difference between inference and training.

Inference: When asking a model a question, you ask it to make a prediction, and the model feeds the input data to the network and its weights (parameters) to compute the output. This sequence is also called a forward pass.

Data scientists freeze the parameters when the neural network is accurate enough, and inference uses the same parameters for every question during inference.

Training: When training a neural network, the first step is called the forward pass, which is the same as inference. The forward pass calculates the prediction of the model for the input data. After the forward pass, the loss is calculated by comparing the prediction to the expected result. The loss is used to calculate a gradient. The gradient guides the framework to increase or decrease the values of each parameter. The Backpropagation pass adjusts the parameters layer by layer to minimize the loss.

The training process repeats the forward pass and backpropagation until the model converges, meaning the loss no longer decreases. Once the model converges, a checkpoint is created, and the parameters are frozen. The Gen-AI application uses this version of the model to generate answers.

I believe the misconception of self-learning capabilities occurs when thinking about either recommender systems (Netflix proposing which series to look at next or Amazon telling you other customers bought item X along with item Y), but a recommender system uses a converged model (frozen weights), with an online feature store. An online feature store provides real-time access to essential features for generating accurate and personalized recommendations. Amazon uses an online feature store to store features about its products and users.

  • Product features: These describe the products themselves, such as their price, category, popularity, brand, color, size, rating, and reviews.
  • User features: These describe the users, such as their past purchase history, demographics (age, gender, location), interests, and browsing behavior.

Suppose an Amazon customer has purchased many books in the past. In that case, the recommender system might recommend other books to the user based on collaborative or content-based filtering. With collaborative filtering, the algorithm identifies users with similar tastes to the target user and recommends items that those users have liked. With content-based filtering, the algorithm recommends items similar to items the target user has liked in the past. It seems like the model is learning when using it. In reality, the model really calculates predictions by using its parameters and data (features) from the online feature store (a database).

The context window is another example of why a Gen-AI application like ChatGPT seems to be learning while using it. Models can appear to “learn” during inference in the same context, but it is because the same information is still part of its context window. The best analogy for an LLM context window is the working memory of a human. For example, if I ask Llama 2, “Who is the CEO of VMware?” and the answer it returns is “Pat Gelsinger,” and I respond with, “No, it’s Raghu,” and it responds with, “Yes, that is correct, the CEO of VMware is Raghu Raghuram.” Then, if I ask it, “Who is the CEO of VMware?” it will respond with “Raghu Raghuram,” but that is only because the same context has the answer. Google’s Bard supports up to 2048 tokens, and ChatGPT supports up to 4096 tokens. By default, Llama 2 has a context window of 4096 tokens, but Huge Llama 2 supports up to 32K tokens. A token is a unit of text that the model uses to process and generate text. Tokens can be words, subwords, or characters.

Everything comes at a price. The larger the context window is, the more memory and compute inference are consumed. Outside of the same conversational context, models are completely stateless and cannot permanently “learn” anything in inference mode. What most LLMaaS do is capture the user prompts, which might contain sensitive data, and use it to form training data sets. On top of that, they use a technique called Reinforcement Learning with Human Feedback (RHLF), which allows it to learn more frequently. Chip Huyen published an excellent article describing RLHF in detail while being understandable for non-data-scientists.

Why does this distinction matter of self-learning matter for non-data scientists? By understanding that default foundation models do not alter state during use, VI-admins, architects, and application developers can design an infrastructure and application that offers high availability while offering a proper life cycle management strategy for the Gen-AI application.

For example, the checkpoint of a model can be loaded and exposed by an inference framework running in separate VMs on separate accelerated ESXi hosts with an anti-affinity rule that ensures that each model API endpoint will not share the same physical infrastructure to reduce the blast radius. A load-balancer in front of the inference frameworks offers the flexibility to take one inference framework offline during maintenance jobs without seeing some behavior change. As the model is frozen, no model version would have learned more than the other during their service time.

We are currently building and developing VMware Private AI Foundation to allow VMware customers to deploy LLMs on-prem securely and keep the data and Gen-AI application access secure. Private data should be kept private, and together with using state-of-the-art foundation models, organizations can safely build their Gen-AI applications to support their business goals.

A special thanks goes out to @Steve Liang for making this article more informative.

Filed Under: Machine Learning

Gen AI Sessions at Explore Barcelona 2023

November 1, 2023 by frankdenneman

I’m looking forward to next week’s VMware Explore conference in Barcelona. It’s going to be a busy week. Hopefully, I will meet many old friends, make new friends, and talk about Gen AI all week. I’m presenting a few sessions, listed below, and meeting with customers to talk about the VMware Private AI foundation. If you are interested and you see me, walk by, come, and have a talk with me.

Hopefully, I will see you at one of the following sessions:

Monday, Nov 6:
Executive Summit
For invite only
Time: 11:00 AM – 1:00 PM CET

For VMware Certified Instructors only: VCI Forum Keynote.
Time: 2:15 PM – 3:00 PM CET
Location: Las Arenas II (2) Hotel Porta Fira, Pl. d’Europa, 45, 08908 L’Hospitalet de Llobregat, Barcelona, Spain 15 min walk from conference, 3 min taxi)

Meet the Experts Sessions
Machine Learning Accelerator Deep Dive [CEIM1199BCN]
Time: 4:30 PM – 5:00 PM CET
Location: Hall 8.0, Meet the Experts, Table 4

Tuesday, Nov 7:
Meet the Experts Sessions
Machine Learning Accelerator Deep Dive [CEIM1199BCN]
Time: 11:30 AM – 12:00 PM CET
Location: Hall 8.0, Meet the Experts, Table 1

Wednesday, Nov 8:
AI and ML Accelerator Deep Dive [CEIB1197BCN]
Time: 9:00 AM – 9:45 AM CET
Location: Hall 8.0, Room 31

CTEX: Building a LLM Deployment Architecture: Five Lessons Learned
Speakers: Shawn Kelly & Frank Denneman
Time: 4:00 PM – 5:00 PM CET
Location: Room CC5 501
Register here: https://lnkd.in/eiTbZusU

Recommended AI Sessions:
Monday, Nov 6:
‘Til the Last Drop of GPU: Run Large Language Models Efficiently [VIT2101BCN] (Tutorial)
Speakers: Agustin Malanco – Triple VCDX
Time: 11:00 AM – 12:30 PM CET
Location: Hall 8.0, Room 18

Tuesday, Nov 7:
Using VMware Private AI for Large Language Models and Generative AI on VMware Cloud Foundation and VMware vSphere [CEIB2050BCN]
Speakers: Justin Murray, Shawn Kelly
Time: 10:30 AM – 11:15 AM CET 
Location: Hall 8.0, Room 12

Empowering Business Growth with Generative AI [VIB2368BCN]
Speakers: Robbie Jerrom, Serge Palaric, Shobhit Bhutani
Time: 2:15 PM – 3:00 PM CET 
Location: Hall 8.0, Room 14

Why do I need AI in my Data Center? Does this help me become a differentiator?
Speaker: Gareth Edwards.
Time: Nov 7, 3:30 – 4:40 PM 
Location: CTEX: Room CC5 501

Meet the Experts:
ML/AI and Large Language Models – Implications for VMware Infrastructure [CEIM2282BCN]
Expert: Justin Murray
Time: 12:30 PM – 1:00 PM CET
Location: Hall 8.0, Meet the Experts, Table 2

Wednesday, Nov 8
AI Without GPUs: Run AI/ML Workloads on Intel AMX CPUs with vSphere 8 and Tanzu [MAPB2367BCN]
Speakers: Chris J Gully, Earl Ruby
Time: 12:45 PM – 1:30 PM CET
Location: Hall 8.0, Room 33

Meet the Experts:
ML/AI and Large Language Models – Implications for VMware Infrastructure [CEIM2282BCN]
Expert: Justin Murray
Time: 4:30 PM – 5:00 PM CET
Location: Hall 8.0, Meet the Experts, Table 4

Filed Under: Uncategorized

Basic Terminologies Large Language Models

August 18, 2023 by frankdenneman

Many organizations are in the process of deploying large language models to apply to their use cases. Publically available Large Language Models (LLMs), such as ChatGPT, are trained on publicly available data through September 2021. However, they are unaware of proprietary private data. Such information is critical to the majority of enterprise processes. To help an LLM to become a useful tool in the enterprise space, an LLM is further trained of finetuned on proprietary data to adapt to organization-specific concepts.

This process introduces terminology that is used more often outside the data science community. Having a better understanding of the following concepts should enhance your ability to navigate the data science team’s requirements when building an LLM deployment architecture.

Neural Network: A Large Language Model (LLM) is a sophisticated neural network architecture designed for natural language processing tasks. LLMs use multiple layers of interconnected nodes to learn language patterns from vast amounts of text data. Parameters, specifically weights, and biases, are crucial components that define how the model processes information.

Weights govern the strength of connections between nodes, influencing the LLM’s ability to capture linguistic nuances. Biases adjust the activation levels of nodes, allowing the model to adapt its responses. During forward propagation, the input text is transformed into tokens and flows through the network, undergoing contextual analysis and predicting subsequent words or phrases.

Backward propagation, an integral part of training, calculates gradients to adjust weights and biases. This process refines the model’s parameters, aligning its language generation with human-written text. Through continuous learning, LLMs become adept at tasks like completing text, translating, and generating new content.

Large Language Model: A Large Language Model is a specific Natural Language Processing (NLP) model that predicts the next word or token. Compared to other NLP models (which are based on language models LM), they are characterized by their size (parameter count of >1B) and are typically trained on vast amounts of text data from the internet. This enables them to learn a broad range of language features and general knowledge. LLMs can handle a number of tasks, from summarization to content generation to question and answer. NLP models are typically focused on sentiment analysis and text classification. LLMs can be further fine-tuned for various tasks with minimal additional training, while NLP is less versatile and requires extensive task-specific training. 

Foundation model: A foundation model is an LLM, sometimes referred to as a Pretrained Language Model (PLM), that is robust enough to act as a “foundation” that can be used as is or fine-tuned/adapted for newer domains. In general, it is trained on diverse data, capable of being customized and fine-tuned for various applications and tasks. For example, the open-source LLaMA 2 models are trained on 2 trillion tokens primarily sourced from publicly available online data sources. Meta required 368640 GPU (A100-80 GBs) hours to train the 13B LLaMa-2 model.  

Token: A token is a fundamental unit of text that an LLM uses to process and understand language. In English, a token can be as short as a single character or as long as a word. For example, in the sentence “I love vSphere,” there are three tokens: “I,” “love,” and “vSphere.” However, the word “vSphere” might be broken down into two tokens depending on the model’s tokenizer. Token length is important because it can impact the overall size of the input data, computational resources required for processing, and the model’s ability to perform linguistically. Tokens help break down the text into manageable pieces for analysis and are essential for language generation. They serve as the building blocks that allow LLMs to comprehend human language. 

Embeddings: are numerical representations of tokens. Each token is transformed into a vector of numbers. These vectors encode the semantic meaning of the text and facilitate computer-based processing and analysis of human language. This language-to-number translation equips machine-learning processes to interact with language data efficiently.

Vector: Vectors are commonly used to represent data points in a multi-dimensional space. Each element of a vector corresponds to a specific dimension, and the combination of these elements defines the position of the vector in that space. Vectors are essential for measuring similarities and transforming data. The article “Explaining Vector Databases in 3 Levels of Difficulty” provides a great primer on vectors, embeddings, and Vector databases.

Image by Leonie Monigatti

Parameters: Parameters are the learned values that a model acquires through training to facilitate predictions or classifications on new data. In neural networks, these parameters are commonly denoted as weights and biases, dictating how input data undergoes transformation into output predictions.    

An LLM model size is typically expressed in parameter count in billions, such as the LLaMA-2 13B. This model contains roughly 13 Billion parameters. The memory consumption depends on the floating-point format (precision). Typically for LLM, a BF16 or FP16 is used. A parameter using BF16 consumes 2 bytes. 1 Billion bytes equal a gigabyte. Thus, we can easily calculate the “static” model memory consumption. LLaMA-2 13B consumes 26 GB of GPU memory when loaded into the GPU. Streaming data into the LLM or fine-tuning the model increases memory consumption. 

Transformer Architecture: The transformer architecture is a foundational framework for training and utilizing large neural networks in natural language processing (NLP). It revolutionized the field by introducing the attention mechanism, which allows the model to weigh the importance of different words in a sentence. It introduced a self-attention mechanism to process input data in parallel, allowing it to truly understand the context of words depending on other words that are placed further away within the sentence (long-range dependencies). The transformer architecture has an encoder and decoder functionality. The encoder creates a “contextualized representation” of a prompt, capturing the meaning and significance of the words in the prompt by considering how they relate to each other within the given context. The decoder receives and processes the contextualized representation, using its understanding of the context to guide output generation. Typically, it generates one word of the output at a time. The first word generated is based on the input context, and the next word is generated based on the input context and the previously generated word. This is called autoregressive context.

(Downstream) Task: An LLM is typically refined further to achieve a specific goal or perform a particular downstream task. Many foundation models today can perform a wide range of NLP tasks without refinement for a particular downstream task. A “task” refers to a specific job or activity that the model is designed to perform using its language understanding and generation capabilities. Tasks can include a wide range of natural language processing objectives. The most common ones are text generation (predicting the next word/token), summarization (given an input, return a short output), question answering (predicting the next word/token based on search results in the prompt), sentiment analysis, translation, and instruction. When a new downstream task is needed, we still need to fine-tune the foundation LLM to respond to domain-specific prompts (eg; ensure it favors a particular term, “Tanzu,” over others).  

Prompt: A prompt is a specific input given to the model to guide its behavior and generate desired text output. The prompt serves as an instruction or query that helps the LLM understand the task or context it needs to respond to. It can be a sentence, a paragraph, or even just a few keywords, depending on the task at hand. For example, if technical marketing wants an LLM to generate a technical specification overview, they can use the prompt “Write a detailed description of our latest update release, highlighting its unique features and capabilities.”

Prompt engineering: The quality and specificity of the prompt can greatly influence the LLM’s output. A well-crafted prompt provides clear guidance to the model, leading to more relevant and accurate generated text. Prompt engineering involves designing prompts effectively to achieve the desired results for various tasks such as translation, summarization, question answering, and more. Prompt engineering can also be used to induce LLMs to say undesirable things, for example, with a prompt that tells an LLM to forget its existing rules before responding. 

Zero-shot: refers to the model’s ability to perform tasks or generate text about new topics without additional training or examples in the prompt. It relies on its existing knowledge to tackle new challenges without requiring specific learning for each one. 

P-tuning: This technique involves fine-tuning a small trainable model prior to engaging the LLM. The small model encodes the text prompt and crafts task-specific virtual tokens, which are then added to the prompt and fed into the LLM. Once the fine-tuning is done, these virtual tokens are stored in a lookup table and used during inference, replacing the smaller model. P-tuning is far more resource efficient compared to other forms of fine-tuning of an LLM. The time required to tune a smaller model can often be measured in minutes instead of days with fine-tuning the LLM.   This is a great talk about P-tuning and what exactly virtual tokens are.

Parameter Efficient Fine-Tuning: PEFT aims to make the fine-tuning process more efficient by focusing on updating only a subset of the model’s parameters (e.g., 100M parameters for a 15B model), rather than retraining the entire model from the ground up. The idea behind PEFT is to balance retaining the knowledge captured by the pre-trained model and tailoring it to perform well on a specific task. By carefully selecting and adjusting certain parameters, you can achieve good performance on the task while reducing the computational cost and time required for fine-tuning. PEFT also overcomes the issues of catastrophic forgetting, a behavior observed during the full fine-tuning of LLMs. There are multiple PEFT methods: : 

Adapter-based PEFT: Adapters are new modules added to the pre-trained network, and only the new parameters are trained, while the original LLM-trained parameters are left untouched. As a result, a small proportion of parameters of the original LLM is trained. This means that the model keeps remembering the previous tasks and uses a small number of new parameters to learn the new task. However, the downside of adding these new layers is the inference latency increase. This issue appears unavoidable because the adapter layers are added sequentially to an LLM. They must be processed sequentially, and there is no way to parallel process them.   

LoRA: Low-Rank Adaptation of Large Language Models: LoRA also freezes the pre-trained parameters, but instead of adding additional layers to the neural network, it adds values to the parameters. As a result, the model can be executed fully in parallel, avoiding additional inferencing latency. In addition, LoRA applies a very intelligent method to reduce the number of trainable parameters, reducing fine-tuning time and memory consumption. A further reduction of memory consumption can be achieved by quantizing the majority (non-outliers) trainable parameters, which results in using an integer data type (INT8) instead of a floating point. A parameter stored in INT8 consumes 8 bits, reducing the memory footprint in half compared to BF16. This is the most popular method of PEFT.  

Quantized Low-Ranking Adaptation (QLoRA):  QLoRA takes it one step further and compresses the weights and activations to 4-bit precision. QLoRA uses a specific data type for storing the base model weights and data type to perform computations. During the computations, QLoRA “dequantizes” the weights from a 4-bit precision (4-bit NormalFloat) into a 16-bit bfloat. The weights are only decompressed when needed; therefore, QLoRA allows large models to run on GPUs with smaller memory capacities.   

IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations): IA3 is a newer PEFT technique intended to improve over LoRA. It offers the same benefits as the LoRA. However, it’s only tested on really “small” LLM models (3B) currently, and most backends do not support this yet. The last update on the GitHub IA3 repo was done in September 2022, which might hint at a lack of popularity of this PEFT method in the data science community. 

Fine Tuning: Fine-tuning is the mechanism to improve LLM models for a specific task (e.g., summarizing legal documents) or domain (e.g., more knowledge about virtualization). During the training of an LLM, the neural network is exposed to unlabeled data and learns through a form of self-supervision (e.g., predicting the next word or sentence entailment). That means that the algorithm explores the patterns, structures, and relationships within the dataset on its own without being provided with specific labeled output for every task, but the robustness of its dataset may mean that it will already be good at some tasks (eg; generating coherent sounding sentences). Fine-tuning always involves supervised learning, where human labelers validate and curate data for a specific task (e.g., to fine-tune the model’s question-answering capabilities, we would have question and answer pairs as the input).  

Reinforcement Learning with Human Feedback (RLHF): When validating the accuracy of a model designed for an image recognition task, we can quickly determine its accuracy. It either classifies the cat correctly or not. With LLMs, it’s a bit more challenging, as it is more difficult to define what makes a “good” text as it is subjective and context-dependent.    

The RHLF method uses feedback generated by humans for generated text to measure the model performance. The method involves deploying multiple models during the training process. A pretrained language model and typically a smaller reward model. The PLM generates multiple responses based on a prompt and a reward model numerical scores each response on how well humans perceive this text. It ranks the response according to human preference and then applies reinforcement learning to train the PLM further to prioritize the response with the higher numerical scores.  

RHLF systems are complex and challenging as gathering human preference data is expensive. RLHF performance is only as good as the quality of the human annotations. People tend to disagree. Therefore, ground truth is often lacking due to variance in opinions by the annotating team.  There are multiple methods of human feedback besides the preference order, such as: 

  • Corrections: Upvoting or downvoting the model output (also known as prompt completion) 
  • Demonstrations: Humans write the preferred answer to a prompt 
  • Natural Language Input: Humans are asked to provide feedback on the model output in natural language 

Supervised fine-tuning (SFT): involves adapting a PLM to a specific downstream task using validated training examples. Validated training examples are pairs of input data and output labels that have been carefully checked and confirmed to be accurate and reliable. SFT tunes the model to a specific task, such as responding to customer support questions. The model can be trained to adapt to specific knowledge-based or demonstrate a particular persona or empathy by using validated training examples.  

Is PEFT considered supervised learning?:  PEFT is not strictly categorized as supervised learning; rather, it is a technique that can be applied within the context of supervised learning. Supervised learning involves training a model on labeled data pairs, where the labels serve as the ground truth for training. PEFT, on the other hand, is a method that aims to fine-tune a pre-trained model using a limited amount of new data specific to a task to adapt the model’s parameters while leveraging its existing knowledge. PEFT is characterized by its emphasis on efficiency and parameter reuse. It does not require extensive retraining on the entire dataset, as it focuses on updating only a subset of the model’s parameters to adapt it to a new task.

In-Context Learning 

Retrieval Augmented Generation (RAG): Once an LLM is trained, it is ignorant of new data. When an organization launches a new product or service, the customer service-focused LLM needs to be retrained to incorporate these new data points. RAG allows the LLM to act as a conversational interface, while RAG “grounds” the LLM with information that aligns with its use case while reducing hallucinations. RAG allows LLMs to access knowledge sources outside the trained model and augment the completion of the prompt with relevant information found in external sources. These sources can be the organizations’ proprietary data sources, like knowledge bases, Bugzilla, Confluence, internal documents, etc. RAG typically works together with a Vector Database or a search engine. 

When the user provides the prompt to the LLM, the RAG framework performs a “contextual search” on the Vector database and generates the context. The RAG framework augments the original prompt by injecting the context into the prompt. The LLM receives the enriched prompt and can generate a better response as it has access to factual data. The LLM sends the generated response back to the user.  

RAG frameworks and Vector DBs can be critical differentiators for organizations with rapidly changing knowledge bases or any other source of information. These organizations cannot change the LLM and redeploy at the same velocity as the demand for up-to-date information. Keeping the models’ answers in lock-step with the velocity of their service offering is challenging. A good example is a FAQ for a chatbot function at a call center for a new product or service. The data science team can add up-to-date data to the Vector DB answer and question domains and ensure that the RAG framework prioritizes the Vector DB over the LLM model for information retrieval. 

Grounding: refers to the process of connecting the model’s prompt completion (responses) to real-world knowledge. Grounding is about ensuring that the model’s outputs are coherent, accurate, and relevant to the information available in the world. Pretraining, prompt-engineering, fine-tuning, and in-context learning all help to ground the LLM.  

Hallucination: a hallucination is a confident response by the LLM that appears coherent and contextually relevant but is not based on accurate information from the input data or is not logically grounded in reality. An example is URLs that are generated by an LLM that does not exist.  

Guardrails: Guardrails restrict LLMs to respond in a particular (safe) manner. These guardrails guide the LLM to stay on topic, avoid hallucinations or toxic responses, or execute malicious code. Chatbots can become an attack surface, and a security guardrail can protect LLM platforms. Guardrails are programmable constraints and are placed between the chatbot and the LLM. The NVIDIA NeMo framework offers a guardrail workflow to apply constraints to the LLM easily. 

Filed Under: Machine Learning

My Sessions at VMware Explore 2023 Las Vegas

August 17, 2023 by frankdenneman

Next week we are back in Las Vegas. Busy times ahead with meeting customers, old friends, making new friends, and presenting a few sessions. Next week I will present at Customer Technical Exchange (CTEX), {code}, and host two meet-the-expert sessions. I will also participate as a part-time judge at the {code} hackathon.

Breakout Sessions

45 Minutes of NUMA (A CPU is not a CPU Anymore [CODEB2761LV]

Tuesday, Aug 22, 2:45 PM – 3:30 PM PDT Level 4, Delfino 4003

Yu Wang and I will dive deep into the new Multi-Chip CPU architecture, On-board Accelerators, and Sub-NUMA clustering and highlight cool new vSphere 8 features that will make your life as a Vi-admin easier.

Building an LLM Deployment Architecture – 5 lessons learned [CTEX]

Tuesday, Aug 22, 4:00 PM – 5:00 PM PDT,  Zeno 4708 

Shawn Kelly and I will go over the details on how to create a deployment architecture for fine-tuning and deploying Large Language Models on your vSphere environment. Shawn and I have been deploying and developing a chatbot application with a data science team within VMware, and we want to share our lessons learned.

What’s new with VMware+NVIDIA AI-Ready Enterprise Platform [CEIB3051LV]

Thursday, Aug 24, 10:00 AM – 10:45 AM PDT Level 2, Titian 2305

Watch Raghu’s keynote, and then come to this session later next week. Due to NDA rules, we cannot disclose any content in the description of the current Explore Content Catalog. If you are planning to run Machine Learning workloads in your organization, join this session to hear about our latest offering that NVIDIA and VMware have created together.

Meet The Expert Sessions

Machine Learning Accelerator Deep Dive [CEIM1849LV]

Monday, Aug 211:00 PM – 1:30 PM PDTMeet the Experts, Level 2, Ballroom G, Table 7

Wednesday, Aug 2311:00 AM – 11:30 AM PDTMeet the Experts, Level 2, Ballroom G, Table 7

Have questions about your ML workload or are not sure whether vSphere is the platform for ML workload? Sign up for the MtE session, and let’s discuss your challenges.

If you see me walk by, say hi!

Filed Under: Machine Learning

  • Page 1
  • Page 2
  • Page 3
  • Interim pages omitted …
  • Page 89
  • Go to Next Page »

Copyright © 2025 · SquareOne Theme on Genesis Framework · WordPress · Log in