The journey from proof of concept to proof of value: part II

Choosing and Optimizing Your LLMs

In our previous post, we discussed several ways to enhance the performance of your AI applications. One of those approaches is the evaluation and choice of a GenAI model. And with the proliferation of modern language models, you could be forgiven for feeling somewhat lost. If you’re creating a Gen AI-forward project using LLMs, you have a huge number of options, and choosing the right one can be daunting. It’s a landscape teeming with commercial offerings, cloud endpoints with ever-expanding catalogs, and a plethora of free "BYOHardware" solutions, many allowing fine-tuning, function-calling, and every other new development. Perhaps most difficult of all is navigating “performance data”, with each model claiming to be tested on the ultimate benchmark set, and none of them fully expressing real-world performance. How do you begin to sift through it all?

While everyone needs a good mix of cost and performance, how those performance metrics correspond to real use cases can be open to interpretation. For example, the Mistral/Mixtral model set uses several benchmarks to measure its performance, which look very useful for comparing models. Likewise, OpenHermes 2.5 (a derivative from the Mistral family), also publishes performance metrics. But comparing these directly is tough because this set of metrics is just one “family” out of an even larger set of sets. In particular, the set used for Mixtral is called GPT4All, which seems fairly comprehensive by itself. But TruthfulQA, AGIEval, BigBench, and others are also “performance benchmarks”, and each comprises additional datasets and measurements which will cast a model’s performance in a slightly different light.

And lest you think testing all of these things would be sufficient, consider the Starling-LM model. This model uses the “MT-Bench” performance evaluation set based on a “chatbot arena” - where users can input prompts of their choice into two anonymized LLMs, and then pick a winner, developing a kind of “ELO Ranking” for LLMs. The Starling team used GPT-4 as a judge for MT-Bench and considered a broad set of queries. While creative and impressive, this is yet another set of benchmarks we would have to consider.

Researchers deeply value complete, exhaustive, and carefully constructed tests that speak to the differences between one model and another, which is shown by these published results. But instead of focusing on globally-defined model performance, the practical developer tends to have a specific use case, drawn from a well-defined set of tasks:

Reading text or chats and pulling out summaries, or noting when desired topics are mentioned
Double-checking another prompt result (often to check hallucination or missed details)
Knowing when something is being asked for (calling functions at the right times)
Generating creative text while still maintaining adherence to some kind of plan or predetermined overview
Structuring existing information (a paragraph into bullet points, a set of loosely-defined data into tightly-defined JSON), and
Doing several of these in a chain without going off the rails.

This isn’t strictly every possible task, but the point is that your actual needs are fairly well-defined: First, you need the simplest high-accuracy AI structure that always does your specific task, and second, the cheapest and fastest resource and model available that still meets that “every-time” requirement. Crucially, you need these things in that order - low-accuracy and failed prompt results are a non-starter no matter how inexpensive or quick they are. Simultaneously trying to engineer prompts or other solutions while also exploring the model space can lead to a lot of lost time and effort. But using the built-up confidence you probably already have with popular commodity models, you can use a prompt-first approach that avoids a lot of this effort:

As mentioned in the last article, choose a model to be “your benchmark” for a while. This is your “daily driver” model, which you use to assist in code or creative writing, and which you’ll accumulate a lot of experience with. I’ve picked GPT3.5/ChatGPT as my starting point, but yours could be different. This model should allow easy prompt testing, be inexpensive enough to qualify even if you used it for your entire workflow, and be somewhere between the most expensive/slow/powerful models out there and ultra-lightweight models that could be run locally. The intermediacy of this model is important because developing in this environment will allow flexibility in both directions (more powerful vs. simpler), based on the project's needs. Remember, you’ll be spending a lot of time here, basically training your intuition on this model - you’ll start to feel when it’s being strained to produce a result, how to best partition and chain prompts for it, and so on.
Write out the real AI tasks required for your project, at the level of the data transformations required. These should be specific actions, such as extracting pertinent details from documents and converting them into a JSON object (e.g. “A person’s free-form interactive text is answered by a model” or “A list of documents go in, and the locations of every mention of airline travel come out”).
Attempt these tasks using a single prompt written to your benchmark model, and get an idea of how well that works. Sometimes this is a fine solution by itself, but more commonly you’ll find an “80% solution”, where the approach works insufficiently well to enter production.
Using your benchmark model, expand this “one prompt” approach in whatever way makes sense. Consider breaking your prompt into a short prompt chain, each prompt of which does a different piece of your task. Or define an agent to handle dynamic input if that’s closer to your use case. At this stage, you are trying to make your system as accurate as possible, using a model you understand well. The performance of this system, rather than published statistics, becomes your functional basis for comparison.
Once performance has reached a plateau, then explore alternative model options. If your system is sufficiently accurate, consider dropping weight and cost - attempt the same actions on a local LLM model, or perhaps deploy a customized local embedder tailored specifically for a RAG system. And if your system isn’t accurate enough, then consider adding model power - move parts of your system to GPT4/4o and determine if that closes the gap, or try a different heavier-weight model. Crucially, consider other models in the same family as your benchmark. The point is to prioritize achieving functionality first, separating optimization efforts from the team's broader pipeline development, even if the team consists solely of you.

This isn’t the only implementation doctrine for Gen AI. Some prefer to begin with the most powerful models available, and then try to reduce time and cost once that approach is functional. But this can require restructuring some of the pipeline again - offloading prompt functionality to smaller models is generally harder than developing on more lightweight models in the first place, and then using additional power where needed. This “intermediate model prototyping” approach, followed by optimization based on a known benchmark, is a useful way to avoid getting lost in the “model performance” woods. And if you have a set of business needs like the ones discussed here, hopefully, it can save you some time as well.

To learn more about MVL, read our manifesto and let’s build a company together.

CHEAT CODES

By Sean Robinson, with Jay Bartot and Keith Rosema • May 23, 2024

The journey from proof of concept to proof of value: part II

Choosing and Optimizing Your LLMs

Recent stories

MVL blog

Looking ahead: 2024 predictions from the team at Madrona Venture Labs

MVL blog

The 2024 generative AI startup playbook

MVL blog

Fund V: fuel to power the next wave of AI founders & their ideas

Let’s start a company together