Skip to content

Commit 6548033

Browse files
committed
Add additional doc, better citations
1 parent bdc9dff commit 6548033

5 files changed

+43254
-13955
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ These scripts for RAG:
2323
* [`rag_queryrewrite.py`](./rag_queryrewrite.py): Adds a query rewriting step to the RAG process, where the user's question is rewritten to improve the retrieval results.
2424
* [`rag_documents_ingestion.py`](./rag_ingestion.py): Ingests PDFs by using pymupdf to convert to markdown, then using Langchain to split into chunks, then using OpenAI to embed the chunks, and finally storing in a local JSON file.
2525
* [`rag_documents_flow.py`](./rag_pdfs.py): A RAG flow that retrieves matching results from the local JSON file created by `rag_documents_ingestion.py`.
26-
* [`rag_hybrid.py`](./rag_hybrid.py): A RAG flow that implements a hybrid retrieval with both vector and keyword search, merging with Reciprocal Rank Fusion (RRF), and semantic re-ranking with a cross-encoder model.
26+
* [`rag_documents_hybrid.py`](./rag_documents_hybrid.py): A RAG flow that implements a hybrid retrieval with both vector and keyword search, merging with Reciprocal Rank Fusion (RRF), and semantic re-ranking with a cross-encoder model.
2727

2828
## Setting up the environment
2929

data/Aphideater_hoverfly.pdf

255 KB
Binary file not shown.
File renamed without changes.

rag_documents_ingestion.py

+4-3
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import json
22
import os
3+
import pathlib
34

45
import azure.identity
56
import openai
@@ -34,12 +35,12 @@
3435
client = openai.OpenAI(api_key=os.environ["OPENAI_KEY"])
3536
MODEL_NAME = os.environ["OPENAI_MODEL"]
3637

37-
38-
filenames = ["data/California_carpenter_bee.pdf", "data/Centris_pallida.pdf", "data/Western_honey_bee.pdf"]
38+
data_dir = pathlib.Path(os.path.dirname(__file__)) / "data"
39+
filenames = ["California_carpenter_bee.pdf", "Centris_pallida.pdf", "Western_honey_bee.pdf", "Aphideater_hoverfly.pdf"]
3940
all_chunks = []
4041
for filename in filenames:
4142
# Extract text from the PDF file
42-
md_text = pymupdf4llm.to_markdown(filename)
43+
md_text = pymupdf4llm.to_markdown(data_dir / filename)
4344

4445
# Split the text into smaller chunks
4546
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(

0 commit comments

Comments
 (0)