How to Avoid a $100 Million Mistake
Leveraging Foundational Models
Companies looking to incorporate AI models into their workplaces and products may be forced to pay even higher prices for an already significant expense in order to increase the accuracy and reliability of the AI they are using. For AI models to perform well in business settings, they need to be trained on high-quality data specific to the business and use-case. The AI models with which we are most familiar are large language models (or LLMs) and mainly come in two types: generative AI models, which use their training data to make predictions based on the probabilities of word sequences, and discriminative models, which use their training data to make classification predictions based on the probabilities of the data features and weights.
The most foolproof way to create an AI model that produces reliable, correct outputs (thus fostering customers’ use and trust) is to simply build a custom model from scratch. Unfortunately, that’s a bit like saying that the best way to buy the optimal wine for your restaurant is to fly to Italy or France and start your own vineyard—an option that feasibly exists, but is out of budget for most. Even assuming you have the right people and the right training data, creating your own AI model requires thousands of Graphics Processing Units (GPUs) to handle the massive datasets. (Forbes estimated that each training run for OpenAI’s GPT-3 LLM required at least $5 million worth of GPUs. OpenAI’s founder said the costs of training were more than $100 million.) Of course some enterprises are able to finance custom AI models for their workplace. While custom models may give companies more accurate results—it comes at serious consequence to their bottom lines (plus, the maintenance costs are also steep).
In response, many Chief Information Officers (CIOs) of leading companies are scrambling to find cost-effective ways to “upskill” broad foundational AI models to make AI work in their business context. Experts have differing opinions on the best way to upskill a broad foundational model for use in business. As always, the starting point is to align the business problem with the AI use case, in order to get the tightest fit possible of the application for the company.
Below, we dive into a few common methods for upskilling AI models. What works best for one company may not be the best choice for another, because (as you can imagine) the cost of implementing the best-quality AI for your company comes at a similarly premier price point.
Prompt engineering. One option is to augment the prompts (the instructions) to the AI models, in order to generate more desirable outputs. There are many forms of prompt engineering. For example, it can be instruction-based (“Provide an answer only from the attached documents.”) It can be contextual prompting. (Original prompt: "Who won the Super Bowl?" vs. Prompt engineering: "Who won the Super Bowl in 2019, who performed at the halftime show in 2019, and how often do Super Bowls happen?"). The process does little to protect against hallucinations (i.e., false, but realistic-sounding outputs), which is a huge risk for companies that purchase AI-enabled tech and services. Prompt engineering also lacks the transferability and versatility needed to produce accurate outputs in the long term, due to model drift and changed company goals. In order to develop and deploy trustworthy AI models that can push company goals forward in the long run, the industry standard needs to evolve past prompt engineering and leverage tools such as Retrieval Augmented Generation (RAG).
Retrieval Augmented Generation (RAG). RAG is a process in which a large language model (like GPT-4, to name the most famous) references a pre-authorized source separate from its training data before it provides the user with an output. That source can include private or proprietary data from the enterprise, which the AI would not otherwise have access to. In this way, RAG models can return more accurate (or applicable) answers and help augment prompt-engineering. Many AI tools for lawyers and professional services use RAG, but empirical research has found that "the hallucination problem persists at significant levels" of up to 33% of the time. While RAG can produce more accuracy in generating responses than prompt-engineering alone, because an LLM is trained to estimate the probability of the next word in a text sequence or detect a pattern and its outcome, hallucinations can continue. It is important to understand that LLMs are not designed for or capable of reasoning (i.e., unable to identify knowledge in text).
Fine-tuning with Expert Input. Another option, which is almost (but not quite) as expensive as custom building, is to fine-tune an LLM using expert input. While RAG gives the model access to more data, fine-tuning modifies the model’s parameters. Fine-tuning is a supervised learning method, which requires humans to select, organize, and label the data used for training. It works by exposing the model to a data set of labeled examples. The model then improves on its initial training by updating its model weights (probabilities) based on the new data. Models that have been fine-tuned tend to create outcomes that are more reliable and accurate than models that have been engineered with prompt engineering or RAG. Fine-tuning, however, is still expensive and complicated. To perform fine-tuning effectively, a company should expect to make a serious financial investment in their technology budget to cover the cost of model development and deployment, in addition to paying for top-notch expert data and talent.