Local RAG from scratch using PyTorch¶
This project uses PyTorch to build a local RAG pipeline from scratch.
We're going to build a chat PDF system. That allows you to give any PDF of your choice, i.e. book, article, etc. Then ask any question and get a custom response.
Excellent frameworks, such as LangChain and LamaIndex, facilitate work with LLMs and help build this kind of pipeline. But our goal is to create everything from scratch.
01. Retrieval Autgmented Generation (RAG)¶
1.1 What's RAG?¶
RAG stands for Retrieval Autgmented Generation.
The paper introduced it as Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks from Facebook AI research.
Let's break RAG down:
- Retrieval - Seeking relevant information given a query within a database and, for example, getting relevant passages from Wikipedia texts when given a question.
- Augmented - Using the relevant retrieved information to modify the input (prompts) to a generative model (e.g., LLMs).
- Generation - Generating output given the augmented input; this is done using a generative model, e.g. LLMs.
1.2 Why RAG is important?¶
LLMs excel in language modeling, demonstrating a deep understanding of language. They can produce responses by leveraging the text they have been trained on. Although LLMs generate good text, it doesn't mean it is factual.
So, let's cover why the RAG is essential for LLMs.
Prevent Hallucination: LLMs are probabilistic models, which means they can give incorrect information. For some use cases, getting factual information is critical. This is when the RAG comes in; it helps with retrieving some facts about the user's query to improve the generation.
Custom Data: LLMs are first pre-trained to understand the language, and then fine-tuned to adapt to specific tasks. However, with this method, every time we receive new data we have to train the model again, which is time and money-consuming. RAG can help LLMs by providing them with relevant information without needing to be fine-tuned.
1.3 Why Local?¶
All the work is done locally. Why ?
Privacy, Speed, and Cost. Let's break them down in detail:
- Privacy: locally we don't need to send sensitive data to an API, e.g. ChatGPT API.
- Speed: we won't have to wait for an API downtime. If our hardware running, the pipeline is running.
- Cost: using APIs costs money for every request you make. if you have your hardware you need to pay nothing.