You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[Purpose of Backend Service](#purpose-of-backend-service)
13
17
-[Related Frontends](#related-frontends)
14
18
-[Use of Hosted AI Services](#use-of-hosted-ai-services)
@@ -44,13 +48,7 @@ It was born out of a hack-a-thon and is still a work in progress. You will find
44
48
45
49
-**A:** Documents are uploaded to the backend service
46
50
- (alternatively, they can be sync'd from an AWS S3 bucket)
47
-
-**B:** The documents are processed/converted/cleaned up
48
-
- currently the well supported document formats include .md and .html pages compiled with 'mkdocs' or 'antora'.
49
-
- the body of the .html is extracted and converted back into .md
50
-
- the .md is "cleaned up" to provide more reliable search results
51
-
- for example: very large code blocks are removed
52
-
- support for .pdf, .txt, and .rst exists but the parsing is not yet well-optimized
53
-
- support for .adoc is a work-in-progress
51
+
-**B:** The documents are processed/converted/cleaned up, see [Document Processing](#document-processing) below
54
52
-**C:** The documents are split into separate text chunks
55
53
-**D:** Embeddings are created for each text chunk and inserted into the vector database
56
54
@@ -61,7 +59,45 @@ It was born out of a hack-a-thon and is still a work in progress. You will find
61
59
-**3:** A similarity search and a max marginal relevance search are performed against the vector DB to find the top N most relevant document chunks
62
60
- The document set searched is scoped only to that specific agent
63
61
-**4:** The LLM is prompted to answer the question using only the context found within the relevant document chunks
64
-
-**5:** The LLM response is streamed by the backend service to the user
62
+
-**5:** The LLM response is streamed by the backend service to the user. Metadata containing the document chunks are also returned to be used as citations.
63
+
64
+
#### Document Processing
65
+
66
+
Document processing is arguably the most important part of a good RAG solution. The quality of the data stored within each text chunk is key to yielding accurate search results that will be passed to the LLM to "help it" answer a user's question.
67
+
68
+
Our documentation set has initially focused on pages that have been compiled using `mkdocs` or `antora`. Therefore, our processing logic has been highly focused on improving the data from those sources.
69
+
70
+
- Currently the well supported document formats include .md and .html pages compiled with 'mkdocs' or 'antora'.
71
+
- Support for .pdf, .txt, and .rst exists but the parsing is not yet well-optimized. Results may vary.
72
+
- Support for .adoc is a work-in-progress and relies on the ability of [docling](https://ds4sd.github.io/docling/) to parse the content
73
+
74
+
#### Document Processing Logic
75
+
76
+
- For markdown content, we:
77
+
1. Replace very large code blocks with text that says "This is a large code block, go read the documentation for more information"
78
+
79
+
- Large code blocks have a tedency to fill text chunks with "useless information" that do not help with answering a user's question
80
+
81
+
2. Convert tables into plain text with with each row having "header: value" statements
82
+
83
+
- This is intended to preserve the context of a large table across text chunks
84
+
85
+
3. Fix relative links by replacing them with absolute URLs
86
+
87
+
- This allows links within documentation to work when users review citation snippets
88
+
89
+
4. Make some formatting optimizations such as removing extra whitespace, removing non-printable characters, etc.
90
+
91
+
- If we detect a .html page, we:
92
+
93
+
1. Check if it was created with mkdocs or antora, and if so extract only the 'content' from the page body (remove header/footer/nav/etc.)
94
+
95
+
2. Convert the page into markdown using `html2text`, then process it as a markdown document as described above
96
+
97
+
- When creating text chunks, we split documents into chunks of about 2000 characters with no overlap.
98
+
- Sometimes the text splitter will create a very small chunk
99
+
- In this case, we will "roll" the text from the small chunk into the next one
100
+
- The goal is to fit as much "quality content" into a chunk as possible
0 commit comments