Large Language Models (LLMs) like ChatGPT have evolved the AI landscape, marked by their impressive ability to understand and generate human-like text across a wide range of topics and applications. Since the release of OpenAI’s ChatGPT, a flurry of companies, ranging from analytics vendors, to business management software packages, to the mighty Microsoft, have been promising us that LLM-enabled interfaces can give instant answers to all our questions. But as we continue to task LLMs with processing increasingly large amounts of data, we’ve noticed a correlation: latency. When processing ever larger requests and generating outputs, LLMs continue to present lackluster performance characteristics. This latency will become increasingly problematic for applications that require real-time responses, such as GPT-enabled chatbots and complex AI search engines.
The need for speed is as important as ever, particularly for websites that deploy LLMs. Over a decade ago, analysts found that the slower the site, the more likely users are to simply click away and go to a competitor: Google found that an extra 0.5 seconds in search page generation time dropped traffic by 20%. 10 years ago, Amazon found that every 100ms of latency cost them 1% in sales. Since latency hurts general UX usability and subsequently costs enterprises sales, LLM-enabled systems are inevitably impacted in the same way. Plus, given the high cost of production — including hardware costs and the fact that LLMs rely on transformer architecture that prices per token — it is all the more imperative for LLM-enabled systems to generate revenue.
This issue isn’t new; organizations have been challenged with reducing the time it takes to process information for decades (actually, centuries). In this post, we’ll explore the factors that impact latency of input and output requests. We elaborate on why we think latency is an inherent concern to the technology of LLMs and predict that longer prompts, which require increased tokens — limited by their sequential nature — are a significant driver of latency. Through benchmarking how OpenAI’s API response time varies in response to different prompt lengths, we explore the relationship between response time and prompt size.
In subsequent posts, we’ll dive into solutions: how to actually optimize prompt size. We’ll explore what kinds of techniques might work without negatively impacting accuracy. But first, let’s talk about tokens.
The factors influencing latency: Why token generation produces a speed tradeoff
Language models are powered by transformer architecture that relies on its ability to predict the next word or sub-word, called tokens, based on the text it has observed so far. These tokens act as a bridge between the raw text data and the numerical representations that enable LLMs to work. The language model predicts one token at a time by assigning probabilities to tokens based on weights the model obtained as a result of its training. Typically, the token with the highest probability is used as the next part of the input. Tokens enable fine-grained operations on text data. By generating tokens, replacing them, or masking them, LLMs can modify text in meaningful ways, with applications like machine translation, sentiment analysis, and text summarization.
The tokenization process presents a weakness: transformers get constrained in the process of generating sequences because each item in the output sequence can only be predicted one at a time, requiring more tokens to be used for the LLM to accurately predict text. Sequence to Sequence (Seq2Sec) LLM-enabled systems create an internal representation of the input sentence and then decode the output sentence. Seq2Seq models estimate the most likely words to complete the sentence based on the high-level representation of the first input sentence, as well as the words that have already been decoded. For example, say we want to generate the continuation of the sentence “We went to…” the model will break up “we, went, to,” and generate the next word: “a.” Using the individual words of “we, went, to, a” the model generates the next word in the sentence, repeating until an “end token” is generated. We require more tokens to clarify complex inquiries or to create more nuanced prompts. We predict that it is precisely these longer token lengths that increase response time. And given that OpenAI bills for each token, including tokens used for prompts and outputs, longer prompt lengths, which require increased tokens, will result in higher costs.
To show the dependency between token size and response time for input generation, we benchmarked how OpenAI’s API processing time varies in response to different prompt lengths. We input a Wikipedia article with a predefined length (in tokens) to the GPT-3.5-turbo model and prompted the model with a question that it could answer by examining the article. There were 10 trials for each token length (between 250 and 4000, with a step of 250).
The time shown on the plot is normalized against a simple and short prompt: for each token length, we subtracted the time it took to execute the short prompt from the time it takes to process a longer token length. First, we wanted to know how long it would take to process a prompt in which both the text and question totaled to 500 tokens. It took 1.75 seconds. Next, we asked the model to process a no-op; we asked it to “Say Hello” — a short prompt that takes 0.75 seconds on average and doesn’t really have a practical application. We normalized the 500-token prompt against the no-op by subtracting. 75 seconds from 1.75 seconds, equaling to one second. Normalizing against simple and short prompts removed the impact of network latency and captured the time the API spends on provisioning resources. We continued this process up to the 4,000 token mark. For prompts that used up to 2,500 tokens, the median total (without subtracting no-op time) processing time was roughly 1.25 seconds. The median response time continued to increase with added tokens/longer prompts. Longer prompt lengths — compared to the “no-op” requests — required extra time. We conclude that the more tokens used, the slower the response time.
Though the frontiers of artificial intelligence are rapidly advancing, when it comes to response speed, LLMs still have a ways to go. The sequential nature of token generation is always going to cause LLM-enabled models to process input and output requests of longer prompts more slowly. But by confirming the correlation between prompt length and response time, we know that the next step for improving latency relies on our ability to optimize prompt size without negatively impacting accuracy.
Accurate parsing enables Q&A quality — but is it possible? No matter the industry or sector, businesses regularly deal with the question of how to efficiently process large amounts of info-heavy documents. Organization leaders, including CTOs, CDOs, and CPOs, are often looking for solutions to this question.
Dive deep into the technicalities of embedding models, vector databases, and optimization strategies to revolutionize information retrieval