CHEAT CODES

By Sean Robinson, with Jay Bartot and Keith Rosema • May 14, 2024

Understanding the COGs of Gen AI: part I

Gen AI has quickly emerged as a transformative technology, sparking widespread excitement among entrepreneurs, investors, enterprises, and consumers. Over the last year, many tech innovators have developed Gen AI proof-of-concepts to explore new capabilities and create advanced applications using top models from OpenAI and its partners.  Many startups are now advancing to the proof-of-value stage, which includes essential initiatives such as customer discovery, iterative product development, and securing investments.

However, transitioning from proof-of-concept to proof-of-value also brings new, unseen operational challenges, such as evaluating large language models (LLM) costs, reliability, and performance. This is complicated by the need to assess the viability of quickly evolving models, both commercial and open-source, and the economic implications of each. At MVL, we're actively immersed in these challenges, learning this new landscape as we build alongside our entrepreneurs.

In this cheat codes series, we assume you, too, are moving on from proof-of-concept to proof-of-value and are encountering similar questions and challenges as you look to launch Gen AI-driven products in 2024.

In these posts, we’ll be delving into topics such as:

  • Estimating and managing initial costs from commercial models
  • The journey and never-ending cycle of cost reduction and optimization
  • Understanding the accuracy, stability, and reliability of your production models

Tokens - The Currency of LLMs

If you’re working on a Gen AI project, whether on the business, product or developer side, you’ve probably heard the terms “token,” "token rate," "input tokens," and "output tokens." These terms are not just technical jargon. They are fundamental concepts that may have a significant impact on your bottom line.

Let's begin with a cautionary tale. A friend once found themselves staring at an eye-opening  $300 bill from OpenAI after just an hour of experimentation. No, they weren't running a large-scale operation or fine-tuning a groundbreaking product. They were simply tinkering, probing the boundaries of possibility.  This tale underscores the need for your whole team to understand the fundamentals of Gen AI COGS (cost of goods). Early-stage Gen AI product decisions can pivot around the issue of token use. While it isn’t necessary to delve into all the technical minutiae, understanding tokens and token counting is important for informed decision-making.

So, what exactly is a “token” in the context of a Large Language Model (LLM)? At its core, a token is akin to a word. It's the fundamental unit of information that an LLM processes, both as input and output, but tokens are also flexible enough to handle arbitrary input that includes more than just words. The technique that usefully holds meaning while allowing for open-ended text like this is called Byte Pair Encoding.

Byte Pair Encoding, one of the “quiet triumphs” that made modern Gen AI possible, distills arbitrary text into efficient tokens by “building up” the most common character combinations found over a large set of text. It’s instructive (and fun) to have a look at OpenAI’s tokenizer demo tool to get a feel for how these tokens work together to encode real text.

Tokenization of simple text, more unusual words, and outright gibberish.  Note how commonly repeated sequences of letters are used to “build up” complex terms and conjunctions rather than reserving a token for every possible word.

The 133% Approximation Rule

But how does this relate to cost management? Enter the 133% approximation rule. On average, a token equates to roughly ¾ of a word. Simplified text may take a bit less, while complex or unique words may take more.  Nevertheless, this rule of thumb serves as a compass that can guide us through the process of token-based cost estimation.

So, let’s dissect the anatomy of cost calculation. Input tokens and output tokens generally have substantially different prices on API services, with the latter commanding a premium due to additional computational operations. Essentially, the process of tokenizing and embedding the inputs (akin to “setting the stage”) happens once at first, but the entire LLM infrastructure has to be run token-by-token to produce output, making it often the greater part of total expenses. While API prices tend to go down over time, it’s worth keeping track.

These fundamental cost calculations motivate the need to bake cost estimation and  management  into the earliest parts of your R&D workflow.  With examples like the above, and an understanding of what costs are likely to be, you can avoid unexpected bills in testing, as well as the greater issue of creating a pipeline and product that costs more than the value it can create.

One approach to cost estimation involves running a set of examples through the LLM to gauge costs upfront. Just monitoring usage statements in real time, and doing so early and often provides real-world insights into your expected expenditure at scale.

And if you’re less inclined to (or not yet ready for) end-to-end testing, it’s a good idea to keep a small set of examples that represent your intended pipeline use and count up the necessary input/output words to do whatever task you expect to have under reasonable conditions.  Then, considering the 133% rule and the input/output token cost will let you keep a weather eye on how much the full process will require.  I’ve found this is a great driver of technical decision-making early in the process - for example, the difference between using one larger, more complex call to an expensive model and using a chain of several prompts made to a cheaper one, each carries their own costs, and these calculations can make the best answer clear and save you a lot in development time and costs (which tend to be much greater than API cost prior to product deployment).

I hope I’ve made the case that the journey into Gen AI development (and, by extension, these articles) should begin with the least exciting step imaginable - acknowledging the role of cost management. By integrating cost estimation into the fabric of our R&D workflow, we mitigate the risk of financial surprises and ensure our pipelines remain efficient and sometimes unnecessarily save ourselves re-development.  So a bit of “boring” early on can save a lot of the “bad kind of exciting” down the road.

What should you do if a particular implementation comes at an untenable cost? Well, there are several ways to bring them down - from prompt optimization and prompt-chaining (using simpler models) to model choice and fine-tuning. We'll be discussing all of these in upcoming posts.

1—In some earlier language models, every possible dictionary word was assigned one “location” in an array,  a technique called “one-hot encoding.”  This still has an important place in predictive ML and NLP techniques, but the size of the English language makes it sub-optimal for generative tasks today.

2—There are a few ways to accomplish this—one is to monitor an API usage dashboard and “count up the cost” for a single (or a few) run of the pipeline. The other is to keep a running tally of tokens in the code by looking at API response metadata.

To learn more about MVL, read our manifesto and let’s build a company together.

Recent stories

View more stories

Let’s start a company together

We are with our founders from day one, for the long run.

Start a company with us