Skip to content

feat: support uploading pdf, docx, txt #140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 78 commits into from
Jul 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
8db5405
feat: upload pdf and send content to LLM
thucpn Jun 24, 2024
0d140f4
Merge branch 'main' into feat/upload-pdf
thucpn Jun 24, 2024
5a06cf9
fix: lint
thucpn Jun 26, 2024
97af04e
refactor: move embed to chat api folder
thucpn Jun 26, 2024
312b6d3
refactor: file content preview component
thucpn Jun 26, 2024
d73ee43
refactor: use content file type for all text files
thucpn Jun 26, 2024
d458783
Merge branch 'main' into feat/upload-pdf
thucpn Jun 26, 2024
035e96e
fix: lint
thucpn Jun 26, 2024
f452027
use pipeline transformation & upgrade llamaindex latest
thucpn Jun 26, 2024
f707c6f
use svg image for pdf, docx, txt
thucpn Jun 26, 2024
2d19560
fix: body collapsed when open dialog
thucpn Jun 26, 2024
04a4dc9
return backend for useClientConfig only
thucpn Jun 26, 2024
4ba4c64
fix: lint
thucpn Jun 26, 2024
fa991f4
Create late-weeks-sneeze.md
thucpn Jun 26, 2024
9c8192d
refactor: rename ContentFile to DocumentFile
thucpn Jun 26, 2024
7ff4843
refactor: rename annotation type
thucpn Jun 26, 2024
3ca2f65
refactor: document preview component
thucpn Jun 26, 2024
689ad9b
fix: lint
thucpn Jun 26, 2024
ee8cb00
refactor: move upload logic to useFile
thucpn Jun 26, 2024
28696d5
refactor: use PDFReader and TextNode
marcusschiesser Jun 26, 2024
cca49c5
fix: next.config
marcusschiesser Jun 26, 2024
afb5405
feat: support uploading docx, pdf, txt
thucpn Jun 27, 2024
4b66d29
move embeddng to chat folder
thucpn Jun 27, 2024
d07ffe9
feat: add embed api for express
thucpn Jun 27, 2024
948b1b6
fix: lint
thucpn Jun 27, 2024
0a195f8
feat: use local index
marcusschiesser Jun 27, 2024
d22310d
Merge branch 'main' into feat/upload-pdf
marcusschiesser Jul 3, 2024
aff87bb
add todos for using doc ids
marcusschiesser Jul 3, 2024
38f231c
add support for fastapi
leehuwuj Jul 5, 2024
2629c88
add filters
leehuwuj Jul 8, 2024
c21e843
fix chat filering python
leehuwuj Jul 8, 2024
34ab445
fix instantiate reader
leehuwuj Jul 8, 2024
321d77d
add save file and fix issues
leehuwuj Jul 8, 2024
259c3ec
change to /chat/upload route
leehuwuj Jul 8, 2024
d6afe28
refactor(frontend): support ref type for document content
thucpn Jul 8, 2024
6298e4b
refactor(nextjs): rename route /embed to /upload
thucpn Jul 8, 2024
608a338
feat: save document when uploading document
thucpn Jul 8, 2024
a9fa5cd
feat: add nodes to vectorstore and query
thucpn Jul 8, 2024
a00cb3d
feat: update document metadata from uploaded file infor
thucpn Jul 8, 2024
425580d
feat: get query filters from document ids
thucpn Jul 8, 2024
3b1c743
refactor(express): use new llamaindex for sharing chat logic between …
thucpn Jul 8, 2024
7a0ce3f
docs: remove useless log
thucpn Jul 8, 2024
d5f4395
fix: wrong import embedding path
thucpn Jul 8, 2024
cc15059
fix: persist vectordb
thucpn Jul 8, 2024
efdd43f
feat: don't open preview dialog for ref content
thucpn Jul 8, 2024
07e0821
use in filter operator
leehuwuj Jul 9, 2024
709ef1f
use mimetypes lib and change private file folder
leehuwuj Jul 9, 2024
43e8035
change tool-output to tool/output
leehuwuj Jul 9, 2024
fdb32b7
add back csv handler
leehuwuj Jul 9, 2024
445b4cc
add a prefix message if user uploaded a file
leehuwuj Jul 10, 2024
a2787ae
add txt reader and fix typo
leehuwuj Jul 10, 2024
d48bcb8
improve code
leehuwuj Jul 10, 2024
93fde20
fix: construct file url from private and file_name
thucpn Jul 10, 2024
884bc6d
refactor: rename embeddings to documents
thucpn Jul 10, 2024
2cc21ac
refactor: split llamaindex folder
thucpn Jul 10, 2024
ce9ce5e
fix: import path to llamaindex ts folder
thucpn Jul 10, 2024
27332e6
fix: wrong engine path
thucpn Jul 10, 2024
c3f70a1
remove redundant log
leehuwuj Jul 10, 2024
20a58c1
remove wrong log
leehuwuj Jul 10, 2024
ab279c6
update file upload to only send text content instead of list
leehuwuj Jul 10, 2024
b638eae
add missing fe and use flatreader
leehuwuj Jul 10, 2024
6aa7d57
improve code
leehuwuj Jul 10, 2024
1f85358
remove adding message prefix
leehuwuj Jul 10, 2024
498723a
update milvus package
leehuwuj Jul 11, 2024
be45b3f
improve log
leehuwuj Jul 11, 2024
5c8e79c
Merge remote-tracking branch 'origin/main' into feat/upload-pdf
leehuwuj Jul 11, 2024
695923c
update code comments
leehuwuj Jul 11, 2024
0e8786b
fix: use makeDir function with default recursive option
thucpn Jul 11, 2024
3404554
fix: lint
thucpn Jul 11, 2024
0ccb51e
fix: set request body size
thucpn Jul 11, 2024
55684b2
fix: lint
thucpn Jul 11, 2024
197cc90
cleanup PR
marcusschiesser Jul 11, 2024
e9ad3ed
fix: DocumentFileContent value can be string in backend ts
thucpn Jul 11, 2024
503141f
Update templates/components/vectordbs/python/none/generate.py
marcusschiesser Jul 11, 2024
30ebbe3
improve code
leehuwuj Jul 11, 2024
f670b1a
fix: while testing fastapi contextengine
marcusschiesser Jul 11, 2024
43149bf
refactor: clean streaming
marcusschiesser Jul 11, 2024
8068ad5
fix: use all annotations in TS code
marcusschiesser Jul 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .changeset/late-weeks-sneeze.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"create-llama": patch
---

Support upload document files: pdf, docx, txt
6 changes: 4 additions & 2 deletions helpers/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import { writeLoadersConfig } from "./datasources";
import { createBackendEnvFile, createFrontendEnvFile } from "./env-variables";
import { PackageManager } from "./get-pkg-manager";
import { installLlamapackProject } from "./llama-pack";
import { makeDir } from "./make-dir";
import { isHavingPoetryLockFile, tryPoetryRun } from "./poetry";
import { installPythonTemplate } from "./python";
import { downloadAndExtractRepo } from "./repo";
Expand Down Expand Up @@ -175,9 +176,10 @@ export const installTemplate = async (
}
}

// Create tool-output directory
// Create outputs directory
if (props.tools && props.tools.length > 0) {
await fsExtra.mkdir(path.join(props.root, "tool-output"));
await makeDir(path.join(props.root, "output/tools"));
await makeDir(path.join(props.root, "output/uploaded"));
}
} else {
// this is a frontend for a full-stack app, create .env file with model information
Expand Down
4 changes: 2 additions & 2 deletions helpers/python.ts
Original file line number Diff line number Diff line change
Expand Up @@ -55,11 +55,11 @@ const getAdditionalDependencies = (
case "milvus": {
dependencies.push({
name: "llama-index-vector-stores-milvus",
version: "^0.1.6",
version: "^0.1.20",
});
dependencies.push({
name: "pymilvus",
version: "2.3.7",
version: "2.4.4",
});
break;
}
Expand Down
6 changes: 6 additions & 0 deletions helpers/typescript.ts
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,12 @@ export const installTSTemplate = async ({
: path.join("src", "controllers");
const enginePath = path.join(root, relativeEngineDestPath, "engine");

// copy llamaindex code for TS templates
await copy("**", path.join(root, relativeEngineDestPath, "llamaindex"), {
parents: true,
cwd: path.join(compPath, "llamaindex", "typescript"),
});

// copy vector db component
if (vectorDb === "llamacloud") {
console.log(
Expand Down
2 changes: 1 addition & 1 deletion questions.ts
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ export const getDataSourceChoices = (
if (selectedDataSource === undefined || selectedDataSource.length === 0) {
if (template !== "multiagent") {
choices.push({
title: "No data, just a simple chat or agent",
title: "No datasource",
value: "none",
});
}
Expand Down
6 changes: 4 additions & 2 deletions templates/components/engines/python/agent/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,17 @@
from app.engine.index import get_index


def get_chat_engine():
def get_chat_engine(filters=None):
system_prompt = os.getenv("SYSTEM_PROMPT")
top_k = os.getenv("TOP_K", "3")
tools = []

# Add query tool if index exists
index = get_index()
if index is not None:
query_engine = index.as_query_engine(similarity_top_k=int(top_k))
query_engine = index.as_query_engine(
similarity_top_k=int(top_k), filters=filters
)
query_engine_tool = QueryEngineTool.from_defaults(query_engine=query_engine)
tools.append(query_engine_tool)

Expand Down
2 changes: 1 addition & 1 deletion templates/components/engines/python/agent/tools/img_gen.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ class ImageGeneratorToolOutput(BaseModel):

class ImageGeneratorTool:
_IMG_OUTPUT_FORMAT = "webp"
_IMG_OUTPUT_DIR = "tool-output"
_IMG_OUTPUT_DIR = "output/tool"
_IMG_GEN_API = "https://api.stability.ai/v2beta/stable-image/generate/core"

def __init__(self, api_key: str = None):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ class E2BToolOutput(BaseModel):

class E2BCodeInterpreter:

output_dir = "tool-output"
output_dir = "output/tool"

def __init__(self, api_key: str = None):
if api_key is None:
Expand Down
3 changes: 2 additions & 1 deletion templates/components/engines/python/chat/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from fastapi import HTTPException


def get_chat_engine():
def get_chat_engine(filters=None):
system_prompt = os.getenv("SYSTEM_PROMPT")
top_k = os.getenv("TOP_K", 3)

Expand All @@ -20,4 +20,5 @@ def get_chat_engine():
similarity_top_k=int(top_k),
system_prompt=system_prompt,
chat_mode="condense_plus_context",
filters=filters,
)
6 changes: 4 additions & 2 deletions templates/components/engines/typescript/agent/chat.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ import path from "node:path";
import { getDataSource } from "./index";
import { createTools } from "./tools";

export async function createChatEngine() {
export async function createChatEngine(documentIds?: string[]) {
const tools: BaseToolWithCall[] = [];

// Add a query engine tool if we have a data source
Expand All @@ -13,7 +13,9 @@ export async function createChatEngine() {
if (index) {
tools.push(
new QueryEngineTool({
queryEngine: index.asQueryEngine(),
queryEngine: index.asQueryEngine({
preFilters: undefined, // TODO: Add filters once LITS supports it (getQueryFilters)
}),
metadata: {
name: "data_query_engine",
description: `A query engine for documents from your data source.`,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ const DEFAULT_META_DATA: ToolMetadata<JSONSchemaType<ImgGeneratorParameter>> = {

export class ImgGeneratorTool implements BaseTool<ImgGeneratorParameter> {
readonly IMG_OUTPUT_FORMAT = "webp";
readonly IMG_OUTPUT_DIR = "tool-output";
readonly IMG_OUTPUT_DIR = "output/tool";
readonly IMG_GEN_API =
"https://api.stability.ai/v2beta/stable-image/generate/core";

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ const DEFAULT_META_DATA: ToolMetadata<JSONSchemaType<InterpreterParameter>> = {
};

export class InterpreterTool implements BaseTool<InterpreterParameter> {
private readonly outputDir = "tool-output";
private readonly outputDir = "output/tool";
private apiKey?: string;
private fileServerURLPrefix?: string;
metadata: ToolMetadata<JSONSchemaType<InterpreterParameter>>;
Expand Down
2 changes: 1 addition & 1 deletion templates/components/engines/typescript/chat/chat.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import { ContextChatEngine, Settings } from "llamaindex";
import { getDataSource } from "./index";

export async function createChatEngine() {
export async function createChatEngine(documentIds?: string[]) {
const index = await getDataSource();
if (!index) {
throw new Error(
Expand Down
115 changes: 115 additions & 0 deletions templates/components/llamaindex/typescript/documents/documents.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
import fs from "fs";
import {
BaseNode,
Document,
IngestionPipeline,
Metadata,
Settings,
SimpleNodeParser,
storageContextFromDefaults,
VectorStoreIndex,
} from "llamaindex";
import { DocxReader } from "llamaindex/readers/DocxReader";
import { PDFReader } from "llamaindex/readers/PDFReader";
import { TextFileReader } from "llamaindex/readers/TextFileReader";
import crypto from "node:crypto";
import { getDataSource } from "../../engine";

const MIME_TYPE_TO_EXT: Record<string, string> = {
"application/pdf": "pdf",
"text/plain": "txt",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document":
"docx",
};

export async function uploadDocument(raw: string): Promise<string[]> {
const [header, content] = raw.split(",");
const mimeType = header.replace("data:", "").replace(";base64", "");
const fileBuffer = Buffer.from(content, "base64");
const documents = await loadDocuments(fileBuffer, mimeType);
const { filename } = await saveDocument(fileBuffer, mimeType);
return await runPipeline(documents, filename);
}

async function runPipeline(
documents: Document[],
filename: string,
): Promise<string[]> {
// mark documents to add to the vector store as private
for (const document of documents) {
document.metadata = {
...document.metadata,
file_name: filename,
private: true,
};
}
const pipeline = new IngestionPipeline({
transformations: [
new SimpleNodeParser({
chunkSize: Settings.chunkSize,
chunkOverlap: Settings.chunkOverlap,
}),
Settings.embedModel,
],
});
const nodes = await pipeline.run({ documents });
await addNodesToVectorStore(nodes);
return documents.map((document) => document.id_);
}

async function loadDocuments(fileBuffer: Buffer, mimeType: string) {
console.log(`Processing uploaded document of type: ${mimeType}`);
switch (mimeType) {
case "application/pdf": {
const pdfReader = new PDFReader();
return await pdfReader.loadDataAsContent(new Uint8Array(fileBuffer));
}
case "text/plain": {
const textReader = new TextFileReader();
return await textReader.loadDataAsContent(fileBuffer);
}
case "application/vnd.openxmlformats-officedocument.wordprocessingml.document": {
const docxReader = new DocxReader();
return await docxReader.loadDataAsContent(fileBuffer);
}
default:
throw new Error(`Unsupported document type: ${mimeType}`);
}
}

async function saveDocument(fileBuffer: Buffer, mimeType: string) {
const fileExt = MIME_TYPE_TO_EXT[mimeType];
if (!fileExt) throw new Error(`Unsupported document type: ${mimeType}`);

const folder = "output/uploaded";
const filename = `${crypto.randomUUID()}.${fileExt}`;
const filepath = `${folder}/${filename}`;
const fileurl = `${process.env.FILESERVER_URL_PREFIX}/${filepath}`;

if (!fs.existsSync(folder)) {
fs.mkdirSync(folder, { recursive: true });
}
await fs.promises.writeFile(filepath, fileBuffer);

console.log(`Saved document file to ${filepath}.\nURL: ${fileurl}`);
return {
filename,
filepath,
fileurl,
};
}

async function addNodesToVectorStore(nodes: BaseNode<Metadata>[]) {
let currentIndex = await getDataSource(); // always not null with an vectordb
if (currentIndex) {
await currentIndex.insertNodes(nodes);
} else {
// Not using vectordb and haven't generated local index yet
const storageContext = await storageContextFromDefaults({
persistDir: "./cache",
});
currentIndex = await VectorStoreIndex.init({ nodes, storageContext });
}
currentIndex.storageContext.docStore.persist();
console.log("Added nodes to the vector store.");
}
Loading
Loading