Let's implement a fundamental RAG system using RetrievalQA to perform queries against a local vector database.
This example demonstrates the core concepts without unnecessary complexity, making it ideal for understanding the basic RAG workflow.
I used a collection of 10 BBC news headlines covering diverse topics including science, technology, sports, and environmental issues.
In a real-world application, this corpus would typically be much larger and could contain full documents rather than just headlines.
The variety in topics helps demonstrate how RAG can retrieve relevant information from different domains based on the user's query.
LLM Selection and Configuration:
The Phi-3-mini model is used, which is specifically chosen for its balance of performance and resource efficiency.
The model's characteristics make it particularly suitable for RAG applications:
- Quantization: The q4 version (4-bit quantization) significantly reduces memory requirements while maintaining most of the model's performance capabilities.
- Size: As a "mini" model with 3.8 billion parameters, it offers a good balance between speed and capability, making it practical for local deployment.
- Instruction tuning: The "instruct" version is fine-tuned to follow instructions and respond appropriately to prompts, which is ideal for RAG workflows where the model needs to understand and follow specific formatting instructions.
LLM Configuration Parameters:
We initialize the LlamaCpp model with carefully chosen parameters that balance quality and performance:
- model_path: Path to the quantized Phi-3-mini model (4-bit quantization for reduced memory footprint).
- max_tokens: Limiting responses to 50 tokens for concise answers, preventing overly verbose responses that might dilute the key information.
- temperature: Set to 0.8, balancing creativity with factuality. Lower values (0.1-0.3) would be more deterministic, while higher values (0.9-1.0) would increase creativity but potentially reduce accuracy.
- top_p: Nucleus sampling parameter set to 0.95, allowing for varied but relevant token selection while maintaining coherence.
- n_ctx: Context window size of 512 tokens - sufficient for the simple RAG implementation but may need adjustment for longer documents.
- seed: Fixed seed for reproducible results, essential for debugging and consistent behavior during development.
Embedding Model Selection:
The all-MiniLM-L6-v2 model from HuggingFace is used, which is specifically designed for semantic similarity tasks and offers excellent performance for RAG applications:
- Dimension: It produces 384-dimensional vectors, balancing expressiveness with storage requirements. Higher dimensions provide more nuanced representations but require more storage and computation.
- Speed: It's optimized for efficiency, allowing for quick embedding generation during both indexing and query time, crucial for responsive RAG systems.
- Quality: Despite its compact size (only 80MB), it performs competitively on general-purpose semantic similarity tasks. For specialized domains, consider fine-tuning or using larger models like all-mpnet-base-v2.
- Resource usage: It can run effectively on CPU, making it accessible for development environments without requiring expensive GPU resources.
Vector Store Implementation:
FAISS (Facebook AI Similarity Search) serves as the vector database, providing efficient similarity search capabilities:
- Efficiency: Optimized for fast similarity search operations using advanced indexing algorithms like IVF (Inverted File) and HNSW (Hierarchical Navigable Small World).
- Scalability: Can handle from thousands to billions of vectors, making it suitable for both prototyping and production deployments.
- In-memory operation: Perfect for examples and small to medium datasets, though it also supports disk-based storage for larger collections.
- Indexing options: Supports various indexing methods for different performance profiles, from exact search to approximate nearest neighbor with configurable trade-offs between speed and accuracy.
RetrievalQA Chain Workflow:
The RetrievalQA chain orchestrates the entire RAG process through several steps:
- Take a user query and convert it into an embedding using the same model used for document indexing.
- Search the vector store for the most semantically similar documents based on cosine similarity or other distance metrics.
- Retrieve the top-k most relevant documents.
- Format these documents along with the query into a structured prompt.
- Send the formatted input to the LLM for response generation.
- Return the generated response along with metadata about the retrieval process.
Install the required modules:
$ pip install langchain_huggingface
$ pip install faiss-cpu
Python code:
$ vi rag-query.py
# these libraries provide the essential functionality for embedding generation (HuggingFace), vector storage (FAISS), and LLM capabilities (LlamaCpp).
from langchain_community.llms.llamacpp import LlamaCpp
from langchain_huggingface.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores.faiss import FAISS
from langchain.chains import RetrievalQA
# corpus (articles titles from bbc)
corpus = [
"Researchers in Japan and the US have unlocked the 60-year mystery of what gives these cats their orange colour.",
"Astronomers have spotted around a dozen of these weird, rare blasts. Could they be signs of a special kind of black hole?",
"The world's largest cloud computing company plans to spend £8bn on new data centres in the UK over the next four years.",
"The Caribbean island is building a power station that will use steam naturally heated by volcanic rock.",
"As Barcelona celebrate winning La Liga, Spanish football expert Guillem Balague looks at how manager Hansi Flick turned his young side into champions.",
"Venezuela's Jhonattan Vegas leads the US PGA Championship with several European players close behind, but Rory McIlroy endures a tough start.",
"Locals and ecologists are troubled by the potential impacts a looming seawall could have on the biodiverse Japanese island of Amami Ōshima.",
"The government has made little progress in preparing the UK for rising temperatures, climate watchdog the CCC says.",
"Half a century after the world's first deep sea mining tests picked nodules from the seafloor off the US east coast, the damage has barely begun to heal.",
"The Cuyahoga River was so polluted it regularly went up in flames. Images of one dramatic blaze in 1952 shaped the US's nascent environmental movement, long after the flames went out."
]
llm = LlamaCpp(
model_path="./Phi-3-mini-4k-instruct-q4.gguf",
max_tokens=50,
temperature=0.8,
top_p=0.95,
n_ctx=512,
seed=50,
verbose=False
)
model_name = "all-MiniLM-L6-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embedding_model = HuggingFaceEmbeddings(
model_name=model_name,
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs
)
# indexing: vector database
vector_store = FAISS.from_texts(corpus, embedding_model)
# RetrievalQA chain
rqa_chain = RetrievalQA.from_chain_type(llm, retriever=vector_store.as_retriever())
question = "What's the major achievement?"
output = rqa_chain({"query": question})
print(output)
Run the Python script:
$ python3 rag-query.py
Output:
{
'query': "What's the major achievement?",
'result': '\n===\nThe major achievement mentioned in the context is Barcelona winning La Liga, and how manager Hansi Flick turned his young side into champions.'
}
The model correctly identified Barcelona's La Liga victory as the major achievement mentioned in the corpus.
It successfully retrieved the relevant information from the vector store and generated a response based on that information.
This demonstrates the effectiveness of semantic search - even though the query used the word "achievement" and the document mentioned "winning" and "champions," the embedding model was able to understand the semantic relationship between these concepts.