Skip to content

Commit b90b703

Browse files
committed
Update README, add demo gif
1 parent a45baf2 commit b90b703

File tree

2 files changed

+44
-8
lines changed

2 files changed

+44
-8
lines changed

README.md

Lines changed: 44 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,14 @@ tangerine is a slim and light-weight RAG (Retieval Augmented Generated) system u
55

66
Each agent is intended to answer questions related to a set of documents known as a knowledge base (KB).
77

8+
![Demo video](docs/demo.gif)
9+
810
- [Overview](#overview)
911
- [Architecture](#architecture)
1012
- [Data Preparation](#data-preparation)
1113
- [Retrieval Augmented Generation (RAG)](#retrieval-augmented-generation-rag)
14+
- [Document Processing](#document-processing)
15+
- [Document Processing Logic](#document-processing-logic)
1216
- [Purpose of Backend Service](#purpose-of-backend-service)
1317
- [Related Frontends](#related-frontends)
1418
- [Use of Hosted AI Services](#use-of-hosted-ai-services)
@@ -44,13 +48,7 @@ It was born out of a hack-a-thon and is still a work in progress. You will find
4448

4549
- **A:** Documents are uploaded to the backend service
4650
- (alternatively, they can be sync'd from an AWS S3 bucket)
47-
- **B:** The documents are processed/converted/cleaned up
48-
- currently the well supported document formats include .md and .html pages compiled with 'mkdocs' or 'antora'.
49-
- the body of the .html is extracted and converted back into .md
50-
- the .md is "cleaned up" to provide more reliable search results
51-
- for example: very large code blocks are removed
52-
- support for .pdf, .txt, and .rst exists but the parsing is not yet well-optimized
53-
- support for .adoc is a work-in-progress
51+
- **B:** The documents are processed/converted/cleaned up, see [Document Processing](#document-processing) below
5452
- **C:** The documents are split into separate text chunks
5553
- **D:** Embeddings are created for each text chunk and inserted into the vector database
5654

@@ -61,7 +59,45 @@ It was born out of a hack-a-thon and is still a work in progress. You will find
6159
- **3:** A similarity search and a max marginal relevance search are performed against the vector DB to find the top N most relevant document chunks
6260
- The document set searched is scoped only to that specific agent
6361
- **4:** The LLM is prompted to answer the question using only the context found within the relevant document chunks
64-
- **5:** The LLM response is streamed by the backend service to the user
62+
- **5:** The LLM response is streamed by the backend service to the user. Metadata containing the document chunks are also returned to be used as citations.
63+
64+
#### Document Processing
65+
66+
Document processing is arguably the most important part of a good RAG solution. The quality of the data stored within each text chunk is key to yielding accurate search results that will be passed to the LLM to "help it" answer a user's question.
67+
68+
Our documentation set has initially focused on pages that have been compiled using `mkdocs` or `antora`. Therefore, our processing logic has been highly focused on improving the data from those sources.
69+
70+
- Currently the well supported document formats include .md and .html pages compiled with 'mkdocs' or 'antora'.
71+
- Support for .pdf, .txt, and .rst exists but the parsing is not yet well-optimized. Results may vary.
72+
- Support for .adoc is a work-in-progress and relies on the ability of [docling](https://ds4sd.github.io/docling/) to parse the content
73+
74+
#### Document Processing Logic
75+
76+
- For markdown content, we:
77+
1. Replace very large code blocks with text that says "This is a large code block, go read the documentation for more information"
78+
79+
- Large code blocks have a tedency to fill text chunks with "useless information" that do not help with answering a user's question
80+
81+
2. Convert tables into plain text with with each row having "header: value" statements
82+
83+
- This is intended to preserve the context of a large table across text chunks
84+
85+
3. Fix relative links by replacing them with absolute URLs
86+
87+
- This allows links within documentation to work when users review citation snippets
88+
89+
4. Make some formatting optimizations such as removing extra whitespace, removing non-printable characters, etc.
90+
91+
- If we detect a .html page, we:
92+
93+
1. Check if it was created with mkdocs or antora, and if so extract only the 'content' from the page body (remove header/footer/nav/etc.)
94+
95+
2. Convert the page into markdown using `html2text`, then process it as a markdown document as described above
96+
97+
- When creating text chunks, we split documents into chunks of about 2000 characters with no overlap.
98+
- Sometimes the text splitter will create a very small chunk
99+
- In this case, we will "roll" the text from the small chunk into the next one
100+
- The goal is to fit as much "quality content" into a chunk as possible
65101

66102
### Purpose of Backend Service
67103

docs/demo.gif

3.3 MB
Loading

0 commit comments

Comments
 (0)