Skip to content

feat: support uploading pdf, docx, txt #140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 78 commits into from
Jul 12, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
8db5405
feat: upload pdf and send content to LLM
thucpn Jun 24, 2024
0d140f4
Merge branch 'main' into feat/upload-pdf
thucpn Jun 24, 2024
5a06cf9
fix: lint
thucpn Jun 26, 2024
97af04e
refactor: move embed to chat api folder
thucpn Jun 26, 2024
312b6d3
refactor: file content preview component
thucpn Jun 26, 2024
d73ee43
refactor: use content file type for all text files
thucpn Jun 26, 2024
d458783
Merge branch 'main' into feat/upload-pdf
thucpn Jun 26, 2024
035e96e
fix: lint
thucpn Jun 26, 2024
f452027
use pipeline transformation & upgrade llamaindex latest
thucpn Jun 26, 2024
f707c6f
use svg image for pdf, docx, txt
thucpn Jun 26, 2024
2d19560
fix: body collapsed when open dialog
thucpn Jun 26, 2024
04a4dc9
return backend for useClientConfig only
thucpn Jun 26, 2024
4ba4c64
fix: lint
thucpn Jun 26, 2024
fa991f4
Create late-weeks-sneeze.md
thucpn Jun 26, 2024
9c8192d
refactor: rename ContentFile to DocumentFile
thucpn Jun 26, 2024
7ff4843
refactor: rename annotation type
thucpn Jun 26, 2024
3ca2f65
refactor: document preview component
thucpn Jun 26, 2024
689ad9b
fix: lint
thucpn Jun 26, 2024
ee8cb00
refactor: move upload logic to useFile
thucpn Jun 26, 2024
28696d5
refactor: use PDFReader and TextNode
marcusschiesser Jun 26, 2024
cca49c5
fix: next.config
marcusschiesser Jun 26, 2024
afb5405
feat: support uploading docx, pdf, txt
thucpn Jun 27, 2024
4b66d29
move embeddng to chat folder
thucpn Jun 27, 2024
d07ffe9
feat: add embed api for express
thucpn Jun 27, 2024
948b1b6
fix: lint
thucpn Jun 27, 2024
0a195f8
feat: use local index
marcusschiesser Jun 27, 2024
d22310d
Merge branch 'main' into feat/upload-pdf
marcusschiesser Jul 3, 2024
aff87bb
add todos for using doc ids
marcusschiesser Jul 3, 2024
38f231c
add support for fastapi
leehuwuj Jul 5, 2024
2629c88
add filters
leehuwuj Jul 8, 2024
c21e843
fix chat filering python
leehuwuj Jul 8, 2024
34ab445
fix instantiate reader
leehuwuj Jul 8, 2024
321d77d
add save file and fix issues
leehuwuj Jul 8, 2024
259c3ec
change to /chat/upload route
leehuwuj Jul 8, 2024
d6afe28
refactor(frontend): support ref type for document content
thucpn Jul 8, 2024
6298e4b
refactor(nextjs): rename route /embed to /upload
thucpn Jul 8, 2024
608a338
feat: save document when uploading document
thucpn Jul 8, 2024
a9fa5cd
feat: add nodes to vectorstore and query
thucpn Jul 8, 2024
a00cb3d
feat: update document metadata from uploaded file infor
thucpn Jul 8, 2024
425580d
feat: get query filters from document ids
thucpn Jul 8, 2024
3b1c743
refactor(express): use new llamaindex for sharing chat logic between …
thucpn Jul 8, 2024
7a0ce3f
docs: remove useless log
thucpn Jul 8, 2024
d5f4395
fix: wrong import embedding path
thucpn Jul 8, 2024
cc15059
fix: persist vectordb
thucpn Jul 8, 2024
efdd43f
feat: don't open preview dialog for ref content
thucpn Jul 8, 2024
07e0821
use in filter operator
leehuwuj Jul 9, 2024
709ef1f
use mimetypes lib and change private file folder
leehuwuj Jul 9, 2024
43e8035
change tool-output to tool/output
leehuwuj Jul 9, 2024
fdb32b7
add back csv handler
leehuwuj Jul 9, 2024
445b4cc
add a prefix message if user uploaded a file
leehuwuj Jul 10, 2024
a2787ae
add txt reader and fix typo
leehuwuj Jul 10, 2024
d48bcb8
improve code
leehuwuj Jul 10, 2024
93fde20
fix: construct file url from private and file_name
thucpn Jul 10, 2024
884bc6d
refactor: rename embeddings to documents
thucpn Jul 10, 2024
2cc21ac
refactor: split llamaindex folder
thucpn Jul 10, 2024
ce9ce5e
fix: import path to llamaindex ts folder
thucpn Jul 10, 2024
27332e6
fix: wrong engine path
thucpn Jul 10, 2024
c3f70a1
remove redundant log
leehuwuj Jul 10, 2024
20a58c1
remove wrong log
leehuwuj Jul 10, 2024
ab279c6
update file upload to only send text content instead of list
leehuwuj Jul 10, 2024
b638eae
add missing fe and use flatreader
leehuwuj Jul 10, 2024
6aa7d57
improve code
leehuwuj Jul 10, 2024
1f85358
remove adding message prefix
leehuwuj Jul 10, 2024
498723a
update milvus package
leehuwuj Jul 11, 2024
be45b3f
improve log
leehuwuj Jul 11, 2024
5c8e79c
Merge remote-tracking branch 'origin/main' into feat/upload-pdf
leehuwuj Jul 11, 2024
695923c
update code comments
leehuwuj Jul 11, 2024
0e8786b
fix: use makeDir function with default recursive option
thucpn Jul 11, 2024
3404554
fix: lint
thucpn Jul 11, 2024
0ccb51e
fix: set request body size
thucpn Jul 11, 2024
55684b2
fix: lint
thucpn Jul 11, 2024
197cc90
cleanup PR
marcusschiesser Jul 11, 2024
e9ad3ed
fix: DocumentFileContent value can be string in backend ts
thucpn Jul 11, 2024
503141f
Update templates/components/vectordbs/python/none/generate.py
marcusschiesser Jul 11, 2024
30ebbe3
improve code
leehuwuj Jul 11, 2024
f670b1a
fix: while testing fastapi contextengine
marcusschiesser Jul 11, 2024
43149bf
refactor: clean streaming
marcusschiesser Jul 11, 2024
8068ad5
fix: use all annotations in TS code
marcusschiesser Jul 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .changeset/late-weeks-sneeze.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"create-llama": patch
---

Support upload document files: pdf, docx, txt
21 changes: 11 additions & 10 deletions templates/components/ui/html/chat/hooks/use-config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3,28 +3,29 @@
import { useEffect, useMemo, useState } from "react";

export interface ChatConfig {
chatAPI?: string;
backend?: string;
starterQuestions?: string[];
}

export function useClientConfig() {
const API_ROUTE = "/api/chat/config";
export function useClientConfig(): ChatConfig {
const chatAPI = process.env.NEXT_PUBLIC_CHAT_API;
const [config, setConfig] = useState<ChatConfig>({
chatAPI,
});
const [config, setConfig] = useState<ChatConfig>();

const configAPI = useMemo(() => {
const backendOrigin = chatAPI ? new URL(chatAPI).origin : "";
return `${backendOrigin}${API_ROUTE}`;
const backendOrigin = useMemo(() => {
return chatAPI ? new URL(chatAPI).origin : "";
}, [chatAPI]);

const configAPI = `${backendOrigin}/api/chat/config`;

useEffect(() => {
fetch(configAPI)
.then((response) => response.json())
.then((data) => setConfig({ ...data, chatAPI }))
.catch((error) => console.error("Error fetching config", error));
}, [chatAPI, configAPI]);

return config;
return {
backend: backendOrigin,
starterQuestions: config?.starterQuestions,
};
}
2 changes: 1 addition & 1 deletion templates/types/streaming/express/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
"dotenv": "^16.3.1",
"duck-duck-scrape": "^2.2.5",
"express": "^4.18.2",
"llamaindex": "0.4.3",
"llamaindex": "0.4.4",
"pdf2json": "3.0.5",
"ajv": "^8.12.0",
"@e2b/code-interpreter": "^0.0.5",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,7 @@ import {
MessageContent,
MessageContentDetail,
} from "llamaindex";

import { CsvFile } from "./stream-helper";
import { DocumentFile } from "./stream-helper";

export const convertMessageContent = (
content: string,
Expand Down Expand Up @@ -60,17 +59,21 @@ const convertAnnotations = (
},
});
}
// convert CSV files to text
if (type === "csv" && "csvFiles" in data && Array.isArray(data.csvFiles)) {
const rawContents = data.csvFiles.map((csv) => {
return "```csv\n" + (csv as CsvFile).content + "\n```";
// convert files to text
if (
type === "document_file" &&
"files" in data &&
Array.isArray(data.files)
) {
const rawContents = data.files.map((file) => {
const { filetype, content } = file as DocumentFile;
return "```" + `${filetype}\n${content}\n` + "```";
});
const csvContent =
"Use data from following CSV raw contents:\n" +
rawContents.join("\n\n");
const fileContent =
`Use data from following raw contents:\n` + rawContents.join("\n\n");
content.push({
type: "text",
text: csvContent,
text: fileContent,
});
}
});
Expand Down
17 changes: 15 additions & 2 deletions templates/types/streaming/express/src/controllers/stream-helper.ts
Original file line number Diff line number Diff line change
Expand Up @@ -113,9 +113,22 @@ export function createCallbackManager(stream: StreamData) {
return callbackManager;
}

export type CsvFile = {
export type TextEmbedding = {
text: string;
embedding: number[];
};

export type ContentFileType = "csv" | "pdf" | "txt" | "docx";

export type DocumentFile = {
id: string;
content: string;
filename: string;
filesize: number;
id: string;
filetype: ContentFileType;
embeddings?: TextEmbedding[];
};

export type DocumentFileData = {
files: DocumentFile[];
};
36 changes: 36 additions & 0 deletions templates/types/streaming/nextjs/app/api/chat/embed/embeddings.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import {
Document,
IngestionPipeline,
MetadataMode,
Settings,
SimpleNodeParser,
} from "llamaindex";
import pdf from "pdf-parse";

export async function splitAndEmbed(content: string) {
const document = new Document({ text: content });
const pipeline = new IngestionPipeline({
transformations: [
new SimpleNodeParser({
chunkSize: Settings.chunkSize,
chunkOverlap: Settings.chunkOverlap,
}),
Settings.embedModel,
],
});
const nodes = await pipeline.run({ documents: [document] });
return nodes.map((node, i) => ({
text: node.getContent(MetadataMode.NONE),
embedding: node.embedding,
}));
}

export async function getPdfDetail(rawPdf: string) {
const pdfBuffer = Buffer.from(rawPdf.split(",")[1], "base64");
const content = (await pdf(pdfBuffer)).text;
const embeddings = await splitAndEmbed(content);
return {
content,
embeddings,
};
}
28 changes: 28 additions & 0 deletions templates/types/streaming/nextjs/app/api/chat/embed/route.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
import { NextRequest, NextResponse } from "next/server";
import { initSettings } from "../engine/settings";
import { getPdfDetail } from "./embeddings";

initSettings();

export async function POST(request: NextRequest) {
try {
const { pdf }: { pdf: string } = await request.json();
if (!pdf) {
return NextResponse.json(
{ error: "pdf is required in the request body" },
{ status: 400 },
);
}
const pdfDetail = await getPdfDetail(pdf);
return NextResponse.json(pdfDetail);
} catch (error) {
console.error("[Embed API]", error);
return NextResponse.json(
{ error: (error as Error).message },
{ status: 500 },
);
}
}

export const runtime = "nodejs";
export const dynamic = "force-dynamic";
23 changes: 13 additions & 10 deletions templates/types/streaming/nextjs/app/api/chat/llamaindex-stream.ts
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,7 @@ import {
MessageContent,
MessageContentDetail,
} from "llamaindex";

import { CsvFile } from "./stream-helper";
import { DocumentFile } from "./stream-helper";

export const convertMessageContent = (
content: string,
Expand Down Expand Up @@ -60,17 +59,21 @@ const convertAnnotations = (
},
});
}
// convert CSV files to text
if (type === "csv" && "csvFiles" in data && Array.isArray(data.csvFiles)) {
const rawContents = data.csvFiles.map((csv) => {
return "```csv\n" + (csv as CsvFile).content + "\n```";
// convert files to text
if (
type === "document_file" &&
"files" in data &&
Array.isArray(data.files)
) {
const rawContents = data.files.map((file) => {
const { filetype, content } = file as DocumentFile;
return "```" + `${filetype}\n${content}\n` + "```";
});
const csvContent =
"Use data from following CSV raw contents:\n" +
rawContents.join("\n\n");
const fileContent =
`Use data from following raw contents:\n` + rawContents.join("\n\n");
content.push({
type: "text",
text: csvContent,
text: fileContent,
});
}
});
Expand Down
17 changes: 15 additions & 2 deletions templates/types/streaming/nextjs/app/api/chat/stream-helper.ts
Original file line number Diff line number Diff line change
Expand Up @@ -113,9 +113,22 @@ export function createCallbackManager(stream: StreamData) {
return callbackManager;
}

export type CsvFile = {
export type TextEmbedding = {
text: string;
embedding: number[];
};

export type ContentFileType = "csv" | "pdf" | "txt" | "docx";

export type DocumentFile = {
id: string;
content: string;
filename: string;
filesize: number;
id: string;
filetype: ContentFileType;
embeddings?: TextEmbedding[];
};

export type DocumentFileData = {
files: DocumentFile[];
};
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ import { ChatInput, ChatMessages } from "./ui/chat";
import { useClientConfig } from "./ui/chat/hooks/use-config";

export default function ChatSection() {
const { chatAPI } = useClientConfig();
const { backend } = useClientConfig();
const {
messages,
input,
Expand All @@ -17,7 +17,7 @@ export default function ChatSection() {
append,
setInput,
} = useChat({
api: chatAPI,
api: `${backend}/api/chat`,
headers: {
"Content-Type": "application/json", // using JSON because of vercel/ai 2.2.26
},
Expand Down
Loading
Loading