Skip to content
This repository was archived by the owner on Jun 8, 2024. It is now read-only.

Commit a7e78fd

Browse files
committed
Make embedchain js package live.
Port python functionality of creating a simple app to JS. - Supports pdf file, webpage and local qna pair as datatypes. - Uses Chroma database as the vector database
1 parent 7297bfc commit a7e78fd

19 files changed

+3721
-1
lines changed

Diff for: .gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -128,3 +128,6 @@ dist
128128
.yarn/build-state.yml
129129
.yarn/install-state.gz
130130
.pnp.*
131+
132+
.ideas.md
133+
.todos.md

Diff for: README.md

+209-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,209 @@
1-
# embedchainjs
1+
# embedchainjs
2+
3+
[![](https://dcbadge.vercel.app/api/server/nhvCbCtKV?style=flat)](https://discord.gg/nhvCbCtKV)
4+
5+
embedchain is a framework to easily create LLM powered bots over any dataset. embedchainjs is Javascript version of embedchain. If you want a python version, check out [embedchain-python](https://github.com/embedchain/embedchain)
6+
7+
It abstracts the entire process of loading dataset, chunking it, creating embeddings and then storing in vector database.
8+
9+
You can add a single or multiple dataset using `.add` and `.add_local` function and then use `.query` function to find an answer from the added datasets.
10+
11+
If you want to create a Naval Ravikant bot which has 2 of his blog posts, as well as a question and answer pair you supply, all you need to do is add the links to the blog posts and the QnA pair and embedchain will create a bot for you.
12+
13+
```javascript
14+
const dotenv = require("dotenv");
15+
dotenv.config();
16+
const { App } = require("embedchain");
17+
18+
//Run the app commands inside an async function only
19+
async function testApp() {
20+
const naval_chat_bot = await App();
21+
22+
// Embed Online Resources
23+
await naval_chat_bot.add("web_page", "https://nav.al/feedback");
24+
await naval_chat_bot.add("web_page", "https://nav.al/agi");
25+
await naval_chat_bot.add(
26+
"pdf_file",
27+
"https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf"
28+
);
29+
30+
// Embed Local Resources
31+
await naval_chat_bot.add_local("qna_pair", [
32+
"Who is Naval Ravikant?",
33+
"Naval Ravikant is an Indian-American entrepreneur and investor.",
34+
]);
35+
36+
const result = await naval_chat_bot.query(
37+
"What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?"
38+
);
39+
console.log(result);
40+
// answer: Naval argues that humans possess the unique capacity to understand explanations or concepts to the maximum extent possible in this physical reality.
41+
}
42+
43+
testApp();
44+
```
45+
46+
# Getting Started
47+
48+
## Installation
49+
50+
- First make sure that you have the package installed. If not, then install it using `npm`
51+
52+
```bash
53+
npm install embedchain
54+
```
55+
56+
- Make sure that dotenv package is installed and your `OPENAI_SECRET_KEY` in a file called `.env` in the root folder. You can install dotenv by
57+
58+
```js
59+
npm install dotenv
60+
```
61+
62+
- Download and install Docker on your device by visiting [this link](https://www.docker.com/). You will need this to run Chroma vector database on your machine.
63+
64+
- Run the following commands to setup Chroma container in Docker
65+
66+
```bash
67+
git clone https://github.com/chroma-core/chroma.git
68+
cd chroma
69+
docker-compose up -d --build
70+
```
71+
72+
- Once Chroma container has been set up, run it inside Docker
73+
74+
## Usage
75+
76+
- We use OpenAI's embedding model to create embeddings for chunks and ChatGPT API as LLM to get answer given the relevant docs. Make sure that you have an OpenAI account and an API key. If you have dont have an API key, you can create one by visiting [this link](https://platform.openai.com/account/api-keys).
77+
78+
- Once you have the API key, set it in an environment variable called `OPENAI_API_KEY`
79+
80+
```js
81+
// Set this inside your .env file
82+
OPENAI_API_KEY = "sk-xxxx";
83+
```
84+
85+
- Load the environment variables inside your .js file using the following commands
86+
87+
```js
88+
const dotenv = require("dotenv");
89+
dotenv.config();
90+
```
91+
92+
- Next import the `App` class from embedchain and use `.add` function to add any dataset.
93+
- Now your app is created. You can use `.query` function to get the answer for any query.
94+
95+
```js
96+
const dotenv = require("dotenv");
97+
dotenv.config();
98+
const { App } = require("embedchain");
99+
100+
async function testApp() {
101+
const naval_chat_bot = await App();
102+
103+
// Embed Online Resources
104+
await naval_chat_bot.add("web_page", "https://nav.al/feedback");
105+
await naval_chat_bot.add("web_page", "https://nav.al/agi");
106+
await naval_chat_bot.add(
107+
"pdf_file",
108+
"https://navalmanack.s3.amazonaws.com/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf"
109+
);
110+
111+
// Embed Local Resources
112+
await naval_chat_bot.add_local("qna_pair", [
113+
"Who is Naval Ravikant?",
114+
"Naval Ravikant is an Indian-American entrepreneur and investor.",
115+
]);
116+
117+
const result = await naval_chat_bot.query(
118+
"What unique capacity does Naval argue humans possess when it comes to understanding explanations or concepts?"
119+
);
120+
console.log(result);
121+
// answer: Naval argues that humans possess the unique capacity to understand explanations or concepts to the maximum extent possible in this physical reality.
122+
}
123+
124+
testApp();
125+
```
126+
127+
- If there is any other app instance in your script or app, you can change the import as
128+
129+
```javascript
130+
const { App: EmbedChainApp } = require("embedchain");
131+
132+
// or
133+
134+
const { App: ECApp } = require("embedchain");
135+
```
136+
137+
## Format supported
138+
139+
We support the following formats:
140+
141+
### PDF File
142+
143+
To add any pdf file, use the data_type as `pdf_file`. Eg:
144+
145+
```javascript
146+
await app.add("pdf_file", "a_valid_url_where_pdf_file_can_be_accessed");
147+
```
148+
149+
### Web Page
150+
151+
To add any web page, use the data_type as `web_page`. Eg:
152+
153+
```javascript
154+
await app.add("web_page", "a_valid_web_page_url");
155+
```
156+
157+
### QnA Pair
158+
159+
To supply your own QnA pair, use the data_type as `qna_pair` and enter a tuple. Eg:
160+
161+
```javascript
162+
await app.add_local("qna_pair", ["Question", "Answer"]);
163+
```
164+
165+
### More Formats coming soon
166+
167+
- If you want to add any other format, please create an [issue](https://github.com/embedchain/embedchainjs/issues) and we will add it to the list of supported formats.
168+
169+
# How does it work?
170+
171+
Creating a chat bot over any dataset needs the following steps to happen
172+
173+
- load the data
174+
- create meaningful chunks
175+
- create embeddings for each chunk
176+
- store the chunks in vector database
177+
178+
Whenever a user asks any query, following process happens to find the answer for the query
179+
180+
- create the embedding for query
181+
- find similar documents for this query from vector database
182+
- pass similar documents as context to LLM to get the final answer.
183+
184+
The process of loading the dataset and then querying involves multiple steps and each steps has nuances of it is own.
185+
186+
- How should I chunk the data? What is a meaningful chunk size?
187+
- How should I create embeddings for each chunk? Which embedding model should I use?
188+
- How should I store the chunks in vector database? Which vector database should I use?
189+
- Should I store meta data along with the embeddings?
190+
- How should I find similar documents for a query? Which ranking model should I use?
191+
192+
These questions may be trivial for some but for a lot of us, it needs research, experimentation and time to find out the accurate answers.
193+
194+
embedchain is a framework which takes care of all these nuances and provides a simple interface to create bots over any dataset.
195+
196+
In the first release, we are making it easier for anyone to get a chatbot over any dataset up and running in less than a minute. All you need to do is create an app instance, add the data sets using `.add` function and then use `.query` function to get the relevant answer.
197+
198+
# Tech Stack
199+
200+
embedchain is built on the following stack:
201+
202+
- [Langchain](https://github.com/hwchase17/langchain) as an LLM framework to load, chunk and index data
203+
- [OpenAI's Ada embedding model](https://platform.openai.com/docs/guides/embeddings) to create embeddings
204+
- [OpenAI's ChatGPT API](https://platform.openai.com/docs/guides/gpt/chat-completions-api) as LLM to get answers given the context
205+
- [Chroma](https://github.com/chroma-core/chroma) as the vector database to store embeddings
206+
207+
# Author
208+
209+
- Taranjeet Singh ([@taranjeetio](https://twitter.com/taranjeetio))

Diff for: embedchain/chunkers/base_chunker.js

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
const { createHash } = require("crypto");
2+
3+
class BaseChunker {
4+
constructor(text_splitter) {
5+
this.text_splitter = text_splitter;
6+
}
7+
8+
async create_chunks(loader, url) {
9+
const documents = [];
10+
const ids = [];
11+
const datas = await loader.load_data(url);
12+
const metadatas = [];
13+
for (const data of datas) {
14+
const content = data["content"];
15+
const meta_data = data["meta_data"];
16+
const chunks = await this.text_splitter.splitText(content);
17+
const url = meta_data["url"];
18+
for (const chunk of chunks) {
19+
const chunk_id = createHash("sha256")
20+
.update(chunk + url)
21+
.digest("hex");
22+
ids.push(chunk_id);
23+
documents.push(chunk);
24+
metadatas.push(meta_data);
25+
}
26+
}
27+
return {
28+
documents: documents,
29+
ids: ids,
30+
metadatas: metadatas,
31+
};
32+
}
33+
}
34+
35+
module.exports = { BaseChunker };

Diff for: embedchain/chunkers/index.js

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
const { QnaPairChunker } = require("./qna_pair");
2+
const { WebPageChunker } = require("./web_page");
3+
const { PdfFileChunker } = require("./pdf_file")
4+
5+
module.exports = {
6+
QnaPairChunker,
7+
WebPageChunker,
8+
PdfFileChunker,
9+
};

Diff for: embedchain/chunkers/pdf_file.js

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
const { BaseChunker } = require("./base_chunker");
2+
const { RecursiveCharacterTextSplitter } = require("langchain/text_splitter");
3+
4+
const TEXT_SPLITTER_CHUNK_PARAMS = {
5+
chunkSize: 1000,
6+
chunkOverlap: 0,
7+
keepSeparator: false,
8+
};
9+
10+
class PdfFileChunker extends BaseChunker {
11+
constructor() {
12+
const text_splitter = new RecursiveCharacterTextSplitter(
13+
TEXT_SPLITTER_CHUNK_PARAMS
14+
);
15+
super(text_splitter);
16+
}
17+
}
18+
19+
module.exports = { PdfFileChunker };

Diff for: embedchain/chunkers/qna_pair.js

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
const { BaseChunker } = require("./base_chunker");
2+
const { RecursiveCharacterTextSplitter } = require("langchain/text_splitter");
3+
4+
const TEXT_SPLITTER_CHUNK_PARAMS = {
5+
chunkSize: 300,
6+
chunkOverlap: 0,
7+
keepSeparator: false,
8+
};
9+
10+
class QnaPairChunker extends BaseChunker {
11+
constructor() {
12+
const text_splitter = new RecursiveCharacterTextSplitter(
13+
TEXT_SPLITTER_CHUNK_PARAMS
14+
);
15+
super(text_splitter);
16+
}
17+
}
18+
19+
module.exports = { QnaPairChunker };

Diff for: embedchain/chunkers/web_page.js

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
const { BaseChunker } = require("./base_chunker");
2+
const { RecursiveCharacterTextSplitter } = require("langchain/text_splitter");
3+
4+
const TEXT_SPLITTER_CHUNK_PARAMS = {
5+
chunkSize: 500,
6+
chunkOverlap: 0,
7+
keepSeparator: false,
8+
};
9+
10+
class WebPageChunker extends BaseChunker {
11+
constructor() {
12+
const text_splitter = new RecursiveCharacterTextSplitter(
13+
TEXT_SPLITTER_CHUNK_PARAMS
14+
);
15+
super(text_splitter);
16+
}
17+
}
18+
19+
module.exports = { WebPageChunker };

0 commit comments

Comments
 (0)