Written by: Dmytro Tkachenko
Our (data scientists, medical professionals, engineers, CEOs, students, grandmas, employees and everyone in between) requirements for quick, accurate, and relevant information in response to increasingly complex inquiries are forcing new and revised strategies and algorithms to be applied to LLMs. Very large amounts of data and documentation are generally required to produce answers in response to conversation-type questions.
This blog will explore the technical challenges evolved from our need for insightful information, and how modern retrieval-augmented generation (RAG) frameworks can help in contextual and broader environments – think medical documentation versus Encyclopedia Britannica – to provide a solution. Our goal, unsurprisingly, will be to apply these models to reduce costs, expedite, and improve the accuracy of answers.
Problem and Consequences
LLMs have a property called ‘context window’ which determines the number of tokens it can process, including input (prompt) and output (completion) tokens (for more information on tokens, read our previous blog). Recent R&D is pushing context windows to be longer, and industry-leading LLMs like Claude from Anthropic can process up to 100k tokens (roughly 75,000 words).
Still, many business and research applications might benefit from processing even larger volumes of text with LLM, with use cases such as information retrieval, summarization, question/answering, etc. For example, a company might want to index its entire internal knowledge base of documents, reports, customer conversations, and so on – and use an LLM to quickly access this information in a conversation-style format.
Even when a corpus of text fits within a context window, it is unreasonably slow to tokenize and transmit the entire set of documents to the LLM whenever one needs it to respond to a question based on the given text. It’s expensive too, with the cost currently standing at $11 per million input tokens. Asking a question using Shakespeare’s works (approximately 885K words) as a primary source, each inquiry would cost about $13!
Our problem is addressed by a recently developed approach by Facebook, called retrieval-augmented generation (RAG). The idea is to use embedding vectors and a similarity search (both defined further below) to circumvent the need to pass all context (an entire knowledge base) to the LLM and instead only pass the parts relevant to answering a query.
In the following, we’ll be providing a high-level view of the steps in executing retrieval-augmented generation divided into two phases. This is followed with a deeper dive into the technology and models that play significant roles in the RAG solution. Lastly, we provide recommendations on how to adapt or modify elements of the solution in technical detail.
The described approach is visualized on the following high-level diagram (parallel with Solution):
1. Indexing flow:
- Collect the entire knowledge base that contains documents (often in different formats such as PDF, Excel, relational databases, etc.)
- Extract plain text from the documents (for example, parse PDFs to retrieve text)
- Split the resulting corpus of text into chunks (chunking strategies may vary and lead to different results)
- Use an embedding model to transform each chunk of text into an embedding vector
- Save all of the embedding vectors into a vector store (a specialized data structure or database designed for efficient nearest-neighbor queries in high-dimensional spaces, commonly used in machine learning applications), along with the index (a mapping from an embedding vector back to a document reference, e.g., a filename)
2. Query flow:
- Upon receiving a user query (such as a question), instruct the LLM to clean/summarize it. If it’s a conversation, at this stage, the LLM can also rephrase the query by considering the chat history.
- Use an embedding model (the same one used for indexing) to vectorize the query
- Query the vector store for top K documents that are the most similar to the question vector (K is a configurable parameter)
- Query the LLM giving it the list of retrieved chunks of text and the question, and instruct it to inspect the given text and answer the question
In this approach, the vector similarity query functions as a pre-filtering step so that the LLM only needs to process the text preliminarily determined as relevant to the question based on the vector similarity. The results should be highly relevant to the query (as long as the applicable information has been referenced), quicker to process (using less tokens), and potentially less expensive to execute.
Below, a detailed diagram that depicting steps described above parallel:
And now the same diagram, annotated with details of each step and tech used:
Technology and Models
(Advisory for readers: this is getting deep, though not necessarily Mariana Trench deep - that’s coming up next.)
Vector similarity search
The technologies that implement vector search (also called ANN – approximate nearest neighbor) can be divided into vector libraries and vector databases.
Vector libraries typically provide bare bones vector search functionality and tend to store vectors in memory only. Their index is usually immutable, meaning it doesn’t support deletions and updates (e.g., replacing a vector associated with a document with another vector). The most popular and best-performing algorithm for an approximate search of nearest neighbors is called HNSW (Hierarchical Navigable Small World). Examples of well-known and mature vector libraries include hnswlib and faiss.
Vector databases, on the other hand, refer to vector stores with richer functionality and implement operations typically supported by traditional databases, such as mutations (updates/deletions). Vector databases also allow storing other data besides vectors, such as the objects a vector is associated with; conversely, vector libraries respond to search queries with object IDs, requiring secondary storage to retrieve the source object by object ID. Vector databases are either built from scratch or based on vector libraries and add more functionality on top.
Vector databases are preferred for building production systems that require horizontal scaling and reliability guarantees (backups, fault tolerance, monitoring/alerting). They also support replication (for increasing reliability) and sharding (to support larger datasets). Examples of popular/mature vector databases include:
- Milvus (open source with managed offering)
- Weaviate (open source with managed offering)
- Qdrant (open source with managed offering)
- Vespa (open source)
- Pinecone (closed source)
There are also established databases and search engines that have vector search capabilities enabled via plugins:
- PostgreSQL (open source; plugins: pgvector or pgvecto.rs)
- OpenSearch / Elasticsearch (open source with managed offering)
You can compare these and more vector databases here: https://objectbox.io/vector-database/
All DBs in the list above are distributed - operating across multiple servers.
Embedding models turn a piece of text into a numeric representation (a vector) that is generally 50 to 1000+ numbers in length. Choosing a model is the most important part of ensuring the quality of results in a retrieval-augmented generation system. A high-quality embedding model has an efficient architecture with a sufficient number of parameters and is trained on a large and diverse corpus of text. For similarity search purposes, a ‘good’ model produces vectors that properly capture the semantic meaning; in other words, texts that are meaningfully similar should be close to each other in the vector space with a chosen distance measure. The quality of embedding models is evaluated with benchmarks, for example, the Massive Textual Embedding Benchmark (MTEB) benchmark. Most RAG systems use a model pre-trained on a certain dataset instead of training one from scratch.
Let’s look at the model landscape:
These models can be run locally with publicly available weights. A good example of a library that facilitates access to multiple open-source models is Sentence-Transformers. Among recent developments, there is a family of models from Jina. There are also many libraries available on the Hugging Face Hub.
The models mentioned above are general purpose, meaning they were trained on a wide variety of text from different domains (e.g., books, news, blog posts, etc.). Some applications may require or benefit from domain-specific models. These are trained on text bodies from specific areas. For example, medical models may be trained on scientific papers, drug trials, etc.; financial models may be based on financial reports. Examples: financial domain - FinGPT (open source), BloombergGPT (proprietary); medical domain - BioMedLM.
If a problem requires processing documents in multiple languages, it is worth considering multilingual embedding models instead of treating each language separately. In these models, the vectors of the same words (sentences) in different languages have corresponding embedding vectors that are close to each other in the vector space. As an example, the vector of ‘cat’ is close in the vector space to the vector of ‘gato’ (‘cat’ in Spanish). There are both open-source (e.g., models in the Sentence-Transformers library) and proprietary (e.g., Cohere) multilingual models.
Tuning and Considerations in Technical Detail
(Secondary advisory for readers: we’re delving deep with details here - it’s pretty much the Mariana Trench.)
When choosing an embedding model, it is important to consider the dimensionality of the vectors it produces for the following reasons:
- It has a major impact on search and indexing performance. In HNSW-based vector stores, the computational time required to perform the search is proportional to the number of dimensions. The complexity of constructing the index is proportional to the dimensionality multiplied by the number of entries in the index.
- An embedding vectors' size determines a search index's memory consumption. High memory requirements may limit the ability to store large quantities of vectors, especially in systems that use memory-only, non-distributed vector stores.
- The embedding dimension should not be too small, either, since this may reduce the ability to capture meaning in a vector.
For example, embeddings provided by OpenAI (text-embedding-ada-002 model) use 1,536 dimensions. This dimensionality is sufficiently large, and using these vectors may incur a considerable computational cost. Depending on the use case, a small model (such as all-MiniLM-L12-v2 from Sentence-Transformers with the dimension of 384) may perform just as well accuracy-wise with a significant reduction in cost and latency.
It is important to recognize, each dimension in an embedding vector captures some nuance of meaning. When considering models for analyzing text coming from multiple domains, higher dimensionality may be preferable; applications in narrow domains (such as medical) that operate with smaller and/or standardized sets of terms might work well with fewer dimensions.
Vector store tuning
Vector databases, especially vector libraries, expose multiple parameters controlling how the underlying index (most commonly based on HNSW) is built and queried.
The primary purpose of these parameters is to balance quality against performance. Performance can be measured with query latency, the number of queries processed per second, or both. The search quality in the approximate nearest neighbors search is measured with recall, which is the fraction of documents that are true neighbors of the given query vector divided by the number of all returned documents.
The parameters below impact the search behavior:
- k - the maximum number of nearest neighbors to be returned as a result.
- ef - the size of the dynamic list for the nearest neighbors (used during the search). Higher ef leads to more accurate but slower search. ef cannot be set lower than the number of queried nearest neighbors k. The value ef can be anything between k and the size of the dataset.
The following parameters control index construction:
- M - the number of bi-directional links created for every new element during construction. Higher M works better on datasets with high intrinsic dimensionality and/or high recall, while low M works better for datasets with low intrinsic dimensionality and/or low recalls. In other words, high-dimensional embedding vectors (with 100+ dimensions) require higher M for optimal performance at high recall.
- efConstruction - the parameter has the same meaning as ef, but controls the indexing time and the indexing accuracy. Bigger efConstruction leads to longer construction but better index quality.
Two parameters are most important for the Retrieval Augmented Generation system:
- Number of similar documents retrieved per query (commonly referred to as K as in “top K most similar”). This parameter is an upper bound on the number of entries retrieved when searching for text chunks most similar to the user query. It is applied to the set of results ordered by the distance measure (most similar first), so it is the top K most similar results. It is application-specific and should be tuned according to some evaluation criteria used for the RAG system. For example, for question answering, the evaluation can measure the accuracy and completeness of answering a set of questions. If K is set too low, some relevant text chunks will not be considered; if it is too high, more irrelevant chunks will be considered when answering the query.
- Minimum similarity. When retrieving a set of chunks most similar to the user query, you can place a limit on the minimum similarity to avoid considering results below this threshold.
Both of these parameters can be configured in a way that causes many documents to be considered when an LLM is instructed to answer the query, given the context comprised of these documents. Generally, this might not be detrimental to accuracy since LLMs can be expected to “figure out” what part of the context is irrelevant.
Managing speed and costs required for responding to increasingly complex queries made against LLMs is a priority for most organizations delivering this capability. Users of information demand accuracy, most without recognition of the volume of materials needed to provide it (why not ingest every known book ever published?) and the resources involved. We believe there are opportunities to meet much of our Users/Customers/Inquisitors needs AND balance cost and latency, by using the Retrieval Automated Generation approach and modifying it as necessary to produce the results best suited to the audience making their inquiries. In our next posts, we’ll be reviewing parsing and experimenting with parsers to efficiently interpret large quantities of complex data in a variety of formats/document structures - an additional strategy to reduce processing costs, and accurately convey insights.
Accurate parsing enables Q&A quality — but is it possible? No matter the industry or sector, businesses regularly deal with the question of how to efficiently process large amounts of info-heavy documents. Organization leaders, including CTOs, CDOs, and CPOs, are often looking for solutions to this question.
Explore the correlation between latency and token generation in Large Language Models. Learn how prompt size impacts response time.