HomeAISuperior Retrieval-Augmented Technology: From Concept to LlamaIndex Implementation | by Leonie Monigatti...

Superior Retrieval-Augmented Technology: From Concept to LlamaIndex Implementation | by Leonie Monigatti | Feb, 2024

For extra concepts on the way to enhance the efficiency of your RAG pipeline to make it production-ready, proceed studying right here:

Suta [CPS] IN
Redmagic WW

This part discusses the required packages and API keys to observe alongside on this article.

Required Packages

This text will information you thru implementing a naive and a complicated RAG pipeline utilizing LlamaIndex in Python.

pip set up llama-index

On this article, we can be utilizing LlamaIndex v0.10. In case you are upgrading from an older LlamaIndex model, it’s essential to run the next instructions to put in and run LlamaIndex correctly:

pip uninstall llama-index
pip set up llama-index --upgrade --no-cache-dir --force-reinstall

LlamaIndex presents an choice to retailer vector embeddings domestically in JSON recordsdata for persistent storage, which is nice for rapidly prototyping an concept. Nonetheless, we are going to use a vector database for persistent storage since superior RAG methods intention for production-ready functions.

Since we are going to want metadata storage and hybrid search capabilities along with storing the vector embeddings, we are going to use the open supply vector database Weaviate (v3.26.2), which helps these options.

pip set up weaviate-client llama-index-vector-stores-weaviate

API Keys

We can be utilizing Weaviate embedded, which you should utilize free of charge with out registering for an API key. Nonetheless, this tutorial makes use of an embedding mannequin and LLM from OpenAI, for which you have to an OpenAI API key. To acquire one, you want an OpenAI account after which “Create new secret key” beneath API keys.

Subsequent, create a neighborhood .env file in your root listing and outline your API keys in it:


Afterwards, you possibly can load your API keys with the next code:

# !pip set up python-dotenv
import os
from dotenv import load_dotenv,find_dotenv


This part discusses the way to implement a naive RAG pipeline utilizing LlamaIndex. You will discover your entire naive RAG pipeline on this Jupyter Pocket book. For the implementation utilizing LangChain, you possibly can proceed in this text (naive RAG pipeline utilizing LangChain).

Step 1: Outline the embedding mannequin and LLM

First, you possibly can outline an embedding mannequin and LLM in a world settings object. Doing this implies you don’t should specify the fashions explicitly within the code once more.

  • Embedding mannequin: used to generate vector embeddings for the doc chunks and the question.
  • LLM: used to generate a solution based mostly on the person question and the related context.
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.settings import Settings

Settings.llm = OpenAI(mannequin="gpt-3.5-turbo", temperature=0.1)
Settings.embed_model = OpenAIEmbedding()

Step 2: Load information

Subsequent, you’ll create a neighborhood listing named information in your root listing and obtain some instance information from the LlamaIndex GitHub repository (MIT license).

!mkdir -p 'information'
!wget '<https://uncooked.githubusercontent.com/run-llama/llama_index/important/docs/examples/information/paul_graham/paul_graham_essay.txt>' -O 'information/paul_graham_essay.txt'

Afterward, you possibly can load the information for additional processing:

from llama_index.core import SimpleDirectoryReader

# Load information
paperwork = SimpleDirectoryReader(

Step 3: Chunk paperwork into nodes

As your entire doc is simply too giant to suit into the context window of the LLM, you have to to partition it into smaller textual content chunks, that are referred to as Nodes in LlamaIndex. You may parse the loaded paperwork into nodes utilizing the SimpleNodeParser with an outlined chunk dimension of 1024.

from llama_index.core.node_parser import SimpleNodeParser

node_parser = SimpleNodeParser.from_defaults(chunk_size=1024)

# Extract nodes from paperwork
nodes = node_parser.get_nodes_from_documents(paperwork)

Step 4: Construct index

Subsequent, you’ll construct the index that shops all of the exterior information in Weaviate, an open supply vector database.

First, you have to to hook up with a Weaviate occasion. On this case, we’re utilizing Weaviate Embedded, which lets you experiment in Notebooks free of charge with out an API key. For a production-ready answer, deploying Weaviate your self, e.g., through Docker or using a managed service, is beneficial.

import weaviate

# Connect with your Weaviate occasion
shopper = weaviate.Consumer(

Subsequent, you’ll construct a VectorStoreIndex from the Weaviate shopper to retailer your information in and work together with.

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.weaviate import WeaviateVectorStore

index_name = "MyExternalContext"

# Assemble vector retailer
vector_store = WeaviateVectorStore(
weaviate_client = shopper,
index_name = index_name

# Arrange the storage for the embeddings
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Setup the index
# construct VectorStoreIndex that takes care of chunking paperwork
# and encoding chunks to embeddings for future retrieval
index = VectorStoreIndex(
storage_context = storage_context,

Step 5: Setup question engine

Lastly, you’ll arrange the index because the question engine.

# The QueryEngine class is supplied with the generator
# and facilitates the retrieval and era steps
query_engine = index.as_query_engine()

Step 6: Run a naive RAG question in your information

Now, you possibly can run a naive RAG question in your information, as proven under:

# Run your naive RAG question
response = query_engine.question(
"What occurred at Interleaf?"

On this part, we are going to cowl some easy changes you can also make to show the above naive RAG pipeline into a complicated one. This walkthrough will cowl the next collection of superior RAG methods:

As we are going to solely cowl the modifications right here, you could find the full end-to-end superior RAG pipeline on this Jupyter Pocket book.

For the sentence window retrieval method, it’s essential to make two changes: First, you could modify the way you retailer and post-process your information. As a substitute of the SimpleNodeParser, we are going to use the SentenceWindowNodeParser.

from llama_index.core.node_parser import SentenceWindowNodeParser

# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(

The SentenceWindowNodeParser does two issues:

  1. It separates the doc into single sentences, which can be embedded.
  2. For every sentence, it creates a context window. When you specify a window_size = 3, the ensuing window can be three sentences lengthy, beginning on the earlier sentence of the embedded sentence and spanning the sentence after. The window can be saved as metadata.

Throughout retrieval, the sentence that the majority intently matches the question is returned. After retrieval, it’s essential to exchange the sentence with your entire window from the metadata by defining a MetadataReplacementPostProcessor and utilizing it within the listing of node_postprocessors.

from llama_index.core.postprocessor import MetadataReplacementPostProcessor

# The goal key defaults to `window` to match the node_parser's default
postproc = MetadataReplacementPostProcessor(


query_engine = index.as_query_engine(
node_postprocessors = [postproc],

Implementing a hybrid search in LlamaIndex is as simple as two parameter modifications to the query_engine if the underlying vector database helps hybrid search queries. The alpha parameter specifies the weighting between vector search and keyword-based search, the place alpha=0 means keyword-based search and alpha=1 means pure vector search.

query_engine = index.as_query_engine(

Including a reranker to your superior RAG pipeline solely takes three easy steps:

  1. First, outline a reranker mannequin. Right here, we’re utilizing the BAAI/bge-reranker-basefrom Hugging Face.
  2. Within the question engine, add the reranker mannequin to the listing of node_postprocessors.
  3. Improve the similarity_top_k within the question engine to retrieve extra context passages, which will be diminished to top_n after reranking.
# !pip set up torch sentence-transformers
from llama_index.core.postprocessor import SentenceTransformerRerank

# Outline reranker mannequin
rerank = SentenceTransformerRerank(
top_n = 2,
mannequin = "BAAI/bge-reranker-base"


# Add reranker to question engine
query_engine = index.as_query_engine(
similarity_top_k = 6,
node_postprocessors = [rerank],

Supply hyperlink

latest articles

ChicMe WW
Head Up For Tails [CPS] IN

explore more