Skip to content

Commit 8733bc1

Browse files
authored
Merge branch 'main' into qdrant
2 parents d814974 + c3a7e50 commit 8733bc1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+1372
-549
lines changed

Cargo.toml

+3
Original file line numberDiff line numberDiff line change
@@ -87,4 +87,7 @@ hyper-rustls = { version = "0.27.5" }
8787
yup-oauth2 = "12.1.0"
8888
rustls = { version = "0.23.25" }
8989
http-body-util = "0.1.3"
90+
yaml-rust2 = "0.10.0"
91+
urlencoding = "2.1.3"
9092
qdrant-client = "1.13.0"
93+

docs/docs/about/community.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ description: Join the CocoIndex community
77

88
Welcome with a huge coconut hug 🥥⋆。˚🤗.
99

10-
We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests on [GitHub](https://github.com/cocoIndex/cocoindex), and discussions in our [Discord](https://discord.com/invite/zpA9S2DR7s).
10+
We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests on [GitHub](https://github.com/cocoindex-io/cocoindex), and discussions in our [Discord](https://discord.com/invite/zpA9S2DR7s).
1111

1212
We would love to fostering an inclusive, welcoming, and supportive environment. Contributing to CocoIndex should feel collaborative, friendly and enjoyable for everyone. Together, we can build better AI applications through robust data infrastructure.
1313

docs/docs/about/contributing.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ We love contributions from our community! This guide explains how to get involve
3636

3737
To submit your code:
3838

39-
1. Fork the [CocoIndex repository](https://github.com/cocoIndex/cocoindex)
39+
1. Fork the [CocoIndex repository](https://github.com/cocoindex-io/cocoindex)
4040
2. [Create a new branch](https://docs.github.com/en/desktop/making-changes-in-a-branch/managing-branches-in-github-desktop) on your fork
4141
3. Make your changes
4242
4. [Open a Pull Request (PR)](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork) when your work is ready for review

docs/docs/core/cli.mdx

+1
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ The following subcommands are available:
6565
| `setup` | Check and apply setup changes for flows, including the internal and target storage (to export). |
6666
| `show` | Show the spec for a specific flow. |
6767
| `update` | Update the index defined by the flow. |
68+
| `evaluate` | Evaluate the flow and dump flow outputs to files. Instead of updating the index, it dumps what should be indexed to files. Mainly used for evaluation purpose. |
6869

6970
Use `--help` to see the full list of subcommands, and `subcommand --help` to see the usage of a specific one.
7071

docs/docs/core/flow_methods.mdx

+21-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ After a flow is defined as discussed in [Flow Definition](/docs/core/flow_def),
1212

1313
## update
1414

15-
The `update()` method will update will update the index defined by the flow.
15+
The `update()` method will update the index defined by the flow.
1616

1717
Once the function returns, the indice is fresh up to the moment when the function is called.
1818

@@ -23,5 +23,25 @@ Once the function returns, the indice is fresh up to the moment when the functio
2323
flow.update()
2424
```
2525

26+
</TabItem>
27+
</Tabs>
28+
29+
## evaluate_and_dump
30+
31+
The `evaluate_and_dump()` method evaluates the flow and dump flow outputs to files.
32+
33+
It takes a `EvaluateAndDumpOptions` dataclass as input to configure, with the following fields:
34+
35+
* `output_dir` (type: `str`, required): The directory to dump the result to.
36+
* `use_cache` (type: `bool`, default: `True`): Use already-cached intermediate data if available.
37+
Note that we only reuse existing cached data without updating the cache even if it's turned on.
38+
39+
<Tabs>
40+
<TabItem value="python" label="Python" default>
41+
42+
```python
43+
flow.evaluate_and_dump(EvaluateAndDumpOptions(output_dir="./eval_output"))
44+
```
45+
2646
</TabItem>
2747
</Tabs>

docs/docs/getting_started/quickstart.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -217,6 +217,6 @@ It will ask you to enter a query and it will return the top 10 results.
217217
Next, you may want to:
218218
219219
* Learn about [CocoIndex Basics](../core/basics.md).
220-
* Learn about other examples in the [examples](https://github.com/cocoIndex/cocoindex/tree/main/examples) directory.
220+
* Learn about other examples in the [examples](https://github.com/cocoindex-io/cocoindex/tree/main/examples) directory.
221221
* The `text_embedding` example is this quickstart with some polishing (loading environment variables from `.env` file, extract pieces shared by the indexing flow and query handler into a function).
222222
* Pick other examples to learn upon your interest.

docs/docs/ops/functions.md

+11
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,17 @@ Return type: `vector[float32; N]`, where `N` is determined by the model
4949
* `output_type` (type: `type`, required): The type of the output. e.g. a dataclass type name. See [Data Types](/docs/core/data_types) for all supported data types. The LLM will output values that match the schema of the type.
5050
* `instruction` (type: `str`, optional): Additional instruction for the LLM.
5151

52+
:::tip Clear type definitions
53+
54+
Definitions of the `output_type` is fed into LLM as guidance to generate the output.
55+
To improve the quality of the extracted information, giving clear definitions for your dataclasses is especially important, e.g.
56+
57+
* Provide readable field names for your dataclasses.
58+
* Provide reasonable docstrings for your dataclasses.
59+
* For any optional fields, clearly annotate that they are optional, by `SomeType | None` or `typing.Optional[SomeType]`.
60+
61+
:::
62+
5263
Input data:
5364

5465
* `text` (type: `str`, required): The text to extract information from.

examples/code_embedding/README.md

+11-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,14 @@
1-
Simple example for cocoindex: build embedding index based on local files.
1+
# Build embedding index for codebase
2+
3+
![Build embedding index for codebase](https://cocoindex.io/blogs/assets/images/cover-9bf0a7cff69b66a40918ab2fc1cea0c7.png)
4+
5+
In this example, we will build an embedding index for a codebase using CocoIndex. CocoIndex provides built-in support for code base chunking, with native Tree-sitter support. [Tree-sitter](https://en.wikipedia.org/wiki/Tree-sitter_%28parser_generator%29) is a parser generator tool and an incremental parsing library, it is available in Rust 🦀 - [GitHub](https://github.com/tree-sitter/tree-sitter). CocoIndex has built-in Rust integration with Tree-sitter to efficiently parse code and extract syntax trees for various programming languages.
6+
7+
8+
Please give [Cocoindex on Github](https://github.com/cocoindex-io/cocoindex) a star to support us if you like our work. Thank you so much with a warm coconut hug 🥥🤗. [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)
9+
10+
You can find a detailed blog post with step by step tutorial and explanations [here](https://cocoindex.io/blogs/index-code-base-for-rag).
11+
212

313
## Prerequisite
414
[Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one.
Binary file not shown.
Binary file not shown.

examples/gdrive_text_embedding/main.py

+6-11
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,6 @@
33
import cocoindex
44
import os
55

6-
def text_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice:
7-
"""
8-
Embed the text using a SentenceTransformer model.
9-
This is a shared logic between indexing and querying, so extract it as a function.
10-
"""
11-
return text.transform(
12-
cocoindex.functions.SentenceTransformerEmbed(
13-
model="sentence-transformers/all-MiniLM-L6-v2"))
14-
156
@cocoindex.flow_def(name="GoogleDriveTextEmbedding")
167
def gdrive_text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
178
"""
@@ -33,7 +24,9 @@ def gdrive_text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope:
3324
language="markdown", chunk_size=2000, chunk_overlap=500)
3425

3526
with doc["chunks"].row() as chunk:
36-
chunk["embedding"] = text_to_embedding(chunk["text"])
27+
chunk["embedding"] = chunk["text"].transform(
28+
cocoindex.functions.SentenceTransformerEmbed(
29+
model="sentence-transformers/all-MiniLM-L6-v2"))
3730
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
3831
text=chunk["text"], embedding=chunk["embedding"])
3932

@@ -47,7 +40,9 @@ def gdrive_text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope:
4740
name="SemanticsSearch",
4841
flow=gdrive_text_embedding_flow,
4942
target_name="doc_embeddings",
50-
query_transform_flow=text_to_embedding,
43+
query_transform_flow=lambda text: text.transform(
44+
cocoindex.functions.SentenceTransformerEmbed(
45+
model="sentence-transformers/all-MiniLM-L6-v2")),
5146
default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)
5247

5348
@cocoindex.main_fn()

python/cocoindex/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
Cocoindex is a framework for building and running indexing pipelines.
33
"""
44
from . import flow, functions, query, sources, storages, cli
5-
from .flow import FlowBuilder, DataScope, DataSlice, Flow, flow_def
5+
from .flow import FlowBuilder, DataScope, DataSlice, Flow, flow_def, EvaluateAndDumpOptions
66
from .llm import LlmSpec, LlmApiType
77
from .vector import VectorSimilarityMetric
88
from .lib import *

python/cocoindex/cli.py

+24
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import click
2+
import datetime
23

34
from . import flow, lib
45
from .setup import check_setup_status, CheckSetupStatusOptions, apply_setup_changes
@@ -52,6 +53,29 @@ def update(flow_name: str | None):
5253
stats = _flow_by_name(flow_name).update()
5354
print(stats)
5455

56+
@cli.command()
57+
@click.argument("flow_name", type=str, required=False)
58+
@click.option(
59+
"-o", "--output-dir", type=str, required=False,
60+
help="The directory to dump the output to.")
61+
@click.option(
62+
"-c", "--use-cache", is_flag=True, show_default=True, default=True,
63+
help="Use already-cached intermediate data if available. "
64+
"Note that we only reuse existing cached data without updating the cache "
65+
"even if it's turned on.")
66+
def evaluate(flow_name: str | None, output_dir: str | None, use_cache: bool = True):
67+
"""
68+
Evaluate the flow and dump flow outputs to files.
69+
70+
Instead of updating the index, it dumps what should be indexed to files.
71+
Mainly used for evaluation purpose.
72+
"""
73+
fl = _flow_by_name(flow_name)
74+
if output_dir is None:
75+
output_dir = f"eval_{fl.name}_{datetime.datetime.now().strftime('%y%m%d_%H%M%S')}"
76+
options = flow.EvaluateAndDumpOptions(output_dir=output_dir, use_cache=use_cache)
77+
fl.evaluate_and_dump(options)
78+
5579
_default_server_settings = lib.ServerSettings.from_env()
5680

5781
@cli.command()

python/cocoindex/flow.py

+29-9
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
from typing import Any, Callable, Sequence, TypeVar, get_origin
1010
from threading import Lock
1111
from enum import Enum
12+
from dataclasses import dataclass
1213

1314
from . import _engine
1415
from . import vector
@@ -61,18 +62,18 @@ def _create_data_slice(
6162
def _spec_kind(spec: Any) -> str:
6263
return spec.__class__.__name__
6364

64-
def _spec_value_dump(v: Any) -> Any:
65-
"""Recursively dump a spec object and its nested attributes to a dictionary."""
65+
def _dump_engine_object(v: Any) -> Any:
66+
"""Recursively dump an object for engine. Engine side uses `Pythonzized` to catch."""
6667
if isinstance(v, type) or get_origin(v) is not None:
6768
return encode_enriched_type(v)
6869
elif isinstance(v, Enum):
6970
return v.value
7071
elif hasattr(v, '__dict__'):
71-
return {k: _spec_value_dump(v) for k, v in v.__dict__.items()}
72+
return {k: _dump_engine_object(v) for k, v in v.__dict__.items()}
7273
elif isinstance(v, (list, tuple)):
73-
return [_spec_value_dump(item) for item in v]
74+
return [_dump_engine_object(item) for item in v]
7475
elif isinstance(v, dict):
75-
return {k: _spec_value_dump(v) for k, v in v.items()}
76+
return {k: _dump_engine_object(v) for k, v in v.items()}
7677
return v
7778

7879
T = TypeVar('T')
@@ -177,7 +178,7 @@ def transform(self, fn_spec: op.FunctionSpec, *args, **kwargs) -> DataSlice:
177178
lambda target_scope, name:
178179
flow_builder_state.engine_flow_builder.transform(
179180
_spec_kind(fn_spec),
180-
_spec_value_dump(fn_spec),
181+
_dump_engine_object(fn_spec),
181182
transform_args,
182183
target_scope,
183184
flow_builder_state.field_name_builder.build_name(
@@ -267,7 +268,7 @@ def export(self, name: str, target_spec: op.StorageSpec, /, *,
267268
{"field_name": field_name, "metric": metric.value}
268269
for field_name, metric in vector_index]
269270
self._flow_builder_state.engine_flow_builder.export(
270-
name, _spec_kind(target_spec), _spec_value_dump(target_spec),
271+
name, _spec_kind(target_spec), _dump_engine_object(target_spec),
271272
index_options, self._engine_data_collector)
272273

273274

@@ -316,13 +317,20 @@ def add_source(self, spec: op.SourceSpec, /, name: str | None = None) -> DataSli
316317
self._state,
317318
lambda target_scope, name: self._state.engine_flow_builder.add_source(
318319
_spec_kind(spec),
319-
_spec_value_dump(spec),
320+
_dump_engine_object(spec),
320321
target_scope,
321322
self._state.field_name_builder.build_name(
322323
name, prefix=_to_snake_case(_spec_kind(spec))+'_'),
323324
),
324325
name
325326
)
327+
@dataclass
328+
class EvaluateAndDumpOptions:
329+
"""
330+
Options for evaluating and dumping a flow.
331+
"""
332+
output_dir: str
333+
use_cache: bool = True
326334

327335
class Flow:
328336
"""
@@ -348,20 +356,32 @@ def __str__(self):
348356
def __repr__(self):
349357
return repr(self._lazy_engine_flow())
350358

359+
@property
360+
def name(self) -> str:
361+
"""
362+
Get the name of the flow.
363+
"""
364+
return self._lazy_engine_flow().name()
365+
351366
def update(self):
352367
"""
353368
Update the index defined by the flow.
354369
Once the function returns, the indice is fresh up to the moment when the function is called.
355370
"""
356371
return self._lazy_engine_flow().update()
357372

373+
def evaluate_and_dump(self, options: EvaluateAndDumpOptions):
374+
"""
375+
Evaluate the flow and dump flow outputs to files.
376+
"""
377+
return self._lazy_engine_flow().evaluate_and_dump(_dump_engine_object(options))
378+
358379
def internal_flow(self) -> _engine.Flow:
359380
"""
360381
Get the engine flow.
361382
"""
362383
return self._lazy_engine_flow()
363384

364-
365385
def _create_lazy_flow(name: str | None, fl_def: Callable[[FlowBuilder, DataScope], None]) -> Flow:
366386
"""
367387
Create a flow without really building it yet.

python/cocoindex/typing.py

+15-4
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
import collections
33
import dataclasses
44
import types
5+
import inspect
56
from typing import Annotated, NamedTuple, Any, TypeVar, TYPE_CHECKING, overload
67

78
class Vector(NamedTuple):
@@ -130,15 +131,23 @@ def analyze_type_info(t) -> AnalyzedTypeInfo:
130131
elif t is float:
131132
kind = 'Float64'
132133
else:
133-
raise ValueError(f"type unsupported yet: {base_type}")
134+
raise ValueError(f"type unsupported yet: {t}")
134135

135136
return AnalyzedTypeInfo(kind=kind, vector_info=vector_info, elem_type=elem_type,
136137
dataclass_type=dataclass_type, attrs=attrs, nullable=nullable)
137138

138139
def _encode_fields_schema(dataclass_type: type) -> list[dict[str, Any]]:
139-
return [{ 'name': field.name,
140-
**encode_enriched_type_info(analyze_type_info(field.type))
141-
} for field in dataclasses.fields(dataclass_type)]
140+
result = []
141+
for field in dataclasses.fields(dataclass_type):
142+
try:
143+
type_info = encode_enriched_type_info(analyze_type_info(field.type))
144+
except ValueError as e:
145+
e.add_note(f"Failed to encode annotation for field - "
146+
f"{dataclass_type.__name__}.{field.name}: {field.type}")
147+
raise
148+
type_info['name'] = field.name
149+
result.append(type_info)
150+
return result
142151

143152
def _encode_type(type_info: AnalyzedTypeInfo) -> dict[str, Any]:
144153
encoded_type: dict[str, Any] = { 'kind': type_info.kind }
@@ -147,6 +156,8 @@ def _encode_type(type_info: AnalyzedTypeInfo) -> dict[str, Any]:
147156
if type_info.dataclass_type is None:
148157
raise ValueError("Struct type must have a dataclass type")
149158
encoded_type['fields'] = _encode_fields_schema(type_info.dataclass_type)
159+
if doc := inspect.getdoc(type_info.dataclass_type):
160+
encoded_type['description'] = doc
150161

151162
elif type_info.kind == 'Vector':
152163
if type_info.vector_info is None:

src/base/field_attrs.rs

+9-3
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,16 @@ use const_format::concatcp;
22

33
pub static COCOINDEX_PREFIX: &str = "cocoindex.io/";
44

5-
/// Expected mime types for bytes and str.
6-
pub static _MIME_TYPE: &str = concatcp!(COCOINDEX_PREFIX, "mime_type");
5+
/// Present for bytes and str. It points to fields that represents the original file name for the data.
6+
/// Type: AnalyzedValueMapping
7+
pub static CONTENT_FILENAME: &str = concatcp!(COCOINDEX_PREFIX, "content_filename");
78

8-
/// Base text for chunks.
9+
/// Present for bytes and str. It points to fields that represents mime types for the data.
10+
/// Type: AnalyzedValueMapping
11+
pub static CONTENT_MIME_TYPE: &str = concatcp!(COCOINDEX_PREFIX, "content_mime_type");
12+
13+
/// Present for chunks. It points to fields that the chunks are for.
14+
/// Type: AnalyzedValueMapping
915
pub static CHUNK_BASE_TEXT: &str = concatcp!(COCOINDEX_PREFIX, "chunk_base_text");
1016

1117
/// Base text for an embedding vector.

0 commit comments

Comments
 (0)