Skip to content

feat(types): simplify/clarify data types (KTable, LTable, Vector, remove .typing.) #405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 29, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
521 changes: 307 additions & 214 deletions Cargo.lock

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions docs/docs/core/basics.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ An indexing flow involves source data and transformed data (either as an interme

Each piece of data has a **data type**, falling into one of the following categories:

* Basic type.
* Struct type: a collection of **fields**, each with a name and a type.
* Collection type: a collection of **rows**, each of which is a struct with specified schema. A collection type can be a table (which has a key field) or a list (ordered but without key field).
* *Basic type*.
* *Struct type*: a collection of **fields**, each with a name and a type.
* *Table type*: a collection of **rows**, each of which is a struct with specified schema. A table type can be a *KTable* (which has a key field) or a *LTable* (ordered but without key field).

An indexing flow always has a top-level struct, containing all data within and managed by the flow.

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/core/custom_function.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Notes:

* The `cocoindex.op.function()` function decorator also takes optional parameters.
See [Parameters for custom functions](#parameters-for-custom-functions) for details.
* Types of arugments and the return value must be annotated, so that CocoIndex will have information about data types of the operation's output fields.
* Types of arguments and the return value must be annotated, so that CocoIndex will have information about data types of the operation's output fields.
See [Data Types](/docs/core/data_types) for supported types.

</TabItem>
Expand Down
116 changes: 69 additions & 47 deletions docs/docs/core/data_types.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,91 +9,113 @@ In CocoIndex, all data processed by the flow have a type determined when the flo

This makes schema of data processed by CocoIndex clear, and easily determine the schema of your index.

## Data Types
## Data Types

You don't need to spell out data types in CocoIndex, when you define the flow using existing operations (source, function, etc).
These operations decide data types of fields produced by them based on the spec and input data types.
All you need to do is to make sure the data passed to functions and storage targets are accepted by them.

When you define [custom functions](/docs/core/custom_function), you need to specify the data types of arguments and return values.

### Basic Types

This is the list of all basic types supported by CocoIndex:

| Type | Description |Type in Python | Original Type in Python |
| Type | Description | Specific Python Type | Native Python Type |
|------|-------------|---------------|-------------------------|
| Bytes | | `bytes` | `bytes` |
| Str | | `str` | `str` |
| Bool | | `bool` | `bool` |
| Int64 | | `int` | `int` |
| Float32 | | `cocoindex.typing.Float32` |`float` |
| Float64 | | `cocoindex.typing.Float64` |`float` |
| Range | | `cocoindex.typing.Range` | `tuple[int, int]` |
| Float32 | | `cocoindex.Float32` |`float` |
| Float64 | | `cocoindex.Float64` |`float` |
| Range | | `cocoindex.Range` | `tuple[int, int]` |
| Uuid | | `uuid.UUId` | `uuid.UUID` |
| Date | | `datetime.date` | `datetime.date` |
| Time | | `datetime.time` | `datetime.time` |
| LocalDatetime | Date and time without timezone | `cocoindex.typing.LocalDateTime` | `datetime.datetime` |
| OffsetDatetime | Date and time with a timezone offset | `cocoindex.typing.OffsetDateTime` | `datetime.datetime` |
| Vector[*type*, *N*?] | |`Annotated[list[type], cocoindex.typing.Vector(dim=N)]` | `list[type]` |
| Json | | `cocoindex.typing.Json` | Any type convertible to JSON by `json` package |
| LocalDatetime | Date and time without timezone | `cocoindex.LocalDateTime` | `datetime.datetime` |
| OffsetDatetime | Date and time with a timezone offset | `cocoindex.OffsetDateTime` | `datetime.datetime` |
| Vector[*T*, *Dim*?] | *T* must be basic type. *Dim* is a positive integer and optional. |`cocoindex.Vector[T]` or `cocoindex.Vector[T, Dim]` | `list[T]` |
| Json | | `cocoindex.Json` | Any data convertible to JSON by `json` package |

Values of all data types can be represented by values in Python's native types (as described under the Native Python Type column).
However, the underlying execution engine and some storage system (like Postgres) has finer distinctions for some types, specifically:

For some types, CocoIndex Python SDK provides annotated types with finer granularity than Python's original type, e.g.
* *Float32* and *Float64* for `float`, with different precision.
* *LocalDateTime* and *OffsetDateTime* for `datetime.datetime`, with different timezone awareness.
* *Vector* has dimension information.
* *Vector* has optional dimension information.
* *Range* and *Json* provide a clear tag for the type, to clearly distinguish the type in CocoIndex.

When defining [custom functions](/docs/core/custom_function), use the specific types as type annotations for arguments and return values.
So CocoIndex will have information about the specific type.
The native Python type is always more permissive and can represent a superset of possible values.
* Only when you annotate the return type of a custom function, you should use the specific type,
so that CocoIndex will have information about the precise type to be used in the execution engine and storage system.
* For all other purposes, e.g. to provide annotation for argument types of a custom function, or used internally in your custom function,
you can choose whatever to use.
The native Python type is usually simpler.

### Struct Type

A struct has a bunch of fields, each with a name and a type.
A Struct has a bunch of fields, each with a name and a type.

In Python, a struct type is represented by a [dataclass](https://docs.python.org/3/library/dataclasses.html),
In Python, a Struct type is represented by a [dataclass](https://docs.python.org/3/library/dataclasses.html),
and all fields must be annotated with a specific type. For example:

```python
from dataclasses import dataclass

@dataclass
class Order:
order_id: str
name: str
price: float
class Person:
first_name: str
last_name
dob: datetime.date
```

### Collection Types
### Table Types

A collection type models a collection of rows, each of which is a struct with specific schema.
A Table type models a collection of rows, each with multiple columns.
Each column of a table has a specific type.

We have two specific types of collection:
We have two specific types of Table types: KTable and LTable.

| Type | Description |Type in Python | Original Type in Python |
|------|-------------|---------------|-------------------------|
| Table[*type*] | The first field is the key, and CocoIndex enforces its uniqueness | `cocoindex.typing.Table[type]` | `list[type]` |
| List[*type*] | No key field; row order is preserved | `cocoindex.typing.List[type]` | `list[type]` |
#### KTable

For example, we can use `cocoindex.typing.Table[Order]` to represent a table of orders, and the first field `order_id` will be taken as the key field.
KTable is a Table type whose first column serves as the key.
The row order of a KTable is not preserved.
Type of the first column (key column) must be a [key type](#key-types).

## Types to Create Indexes
In Python, a KTable type is represented by `dict[K, V]`.
The `V` should be a dataclass, representing the value fields of each row.
For example, you can use `dict[str, Person]` to represent a KTable, with 4 columns: key (Str), `first_name` (Str), `last_name` (Str), `dob` (Date).

### Key Types
Note that if you want to use a struct as the key, you need to annotate the struct with `@dataclass(frozen=True)`, so the values are immutable.
For example:

Currently, the following types are supported as types for key fields:
```python
@dataclass(frozen=True)
class PersonKey:
id_kind: str
id: str
```

- `bytes`
- `str`
- `bool`
- `int64`
- `range`
- `uuid`
- `date`
- Struct with all fields being key types
Then you can use `dict[PersonKey, Person]` to represent a KTable keyed by `PersonKey`.


#### LTable

### Vector Type
LTable is a Table type whose row order is preserved. LTable has no key column.

Users can create vector index on fields with `vector` types.
A vector index also needs to be configured with a similarity metric, and the index is only effective when this metric is used during retrieval.
In Python, a LTable type is represented by `list[R]`, where `R` is a dataclass representing a row.
For example, you can use `list[Person]` to represent a LTable with 3 columns: `first_name` (Str), `last_name` (Str), `dob` (Date).

Following metrics are supported:
## Key Types

| Metric Name | Description | Similarity Order |
|-------------|-------------|------------------|
| `CosineSimilarity` | [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) | Larger is more similar |
| `L2Distance` | [L2 distance (a.k.a. Euclidean distance)](https://en.wikipedia.org/wiki/Euclidean_distance) | Smaller is more similar |
| `InnerProduct` | [Inner product](https://en.wikipedia.org/wiki/Inner_product_space) | Larger is more similar |
Currently, the following types are key types

- Bytes
- Str
- Bool
- Int64
- Range
- Uuid
- Date
- Struct with all fields being key types
27 changes: 22 additions & 5 deletions docs/docs/core/flow_def.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
title: Flow Definition
description: Define a CocoIndex flow, by specifying source, transformations and storages, and connect input/output data of them.
toc_max_heading_level: 4
---

import Tabs from '@theme/Tabs';
Expand Down Expand Up @@ -178,7 +179,7 @@ def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataSco

### For each row

If the data slice has `Table` type, you can call `row()` method to obtain a child scope representing each row, to apply operations on each row.
If the data slice has [table type](/docs/core/data_types#table-types), you can call `row()` method to obtain a child scope representing each row, to apply operations on each row.

<Tabs>
<TabItem value="python" label="Python" default>
Expand Down Expand Up @@ -281,16 +282,32 @@ The target storage is managed by CocoIndex, i.e. it'll be created by [CocoIndex
The `name` for the same storage should remain stable across different runs.
If it changes, CocoIndex will treat it as an old storage removed and a new one created, and perform setup changes and reindexing accordingly.

#### Storage Indexes
## Storage Indexes

Many storage supports indexes, to boost efficiency in retrieving data.
CocoIndex provides a common way to configure indexes for various storages.

* *Primary key*. `primary_key_fields` (`Sequence[str]`): the fields to be used as primary key. Types of the fields must be supported as key fields. See [Key Types](data_types#key-types) for more details.
* *Vector index*. `vector_indexes` (`Sequence[VectorIndexDef]`): the fields to create vector index. `VectorIndexDef` has the following fields:
### Primary Key

*Primary key* is specified by `primary_key_fields` (`Sequence[str]`).
Types of the fields must be key types. See [Key Types](data_types#key-types) for more details.

### Vector Index

*Vector index* is specified by `vector_indexes` (`Sequence[VectorIndexDef]`). `VectorIndexDef` has the following fields:

* `field_name`: the field to create vector index.
* `metric`: the similarity metric to use. See [Vector Type](data_types#vector-type) for more details about supported similarity metrics.
* `metric`: the similarity metric to use.

#### Similarity Metrics

Following metrics are supported:

| Metric Name | Description | Similarity Order |
|-------------|-------------|------------------|
| CosineSimilarity | [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) | Larger is more similar |
| L2Distance | [L2 distance (a.k.a. Euclidean distance)](https://en.wikipedia.org/wiki/Euclidean_distance) | Smaller is more similar |
| InnerProduct | [Inner product](https://en.wikipedia.org/wiki/Inner_product_space) | Larger is more similar |

## Miscellaneous

Expand Down
7 changes: 4 additions & 3 deletions docs/docs/getting_started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,11 +112,12 @@ Notes:
* `doc`, representing each row of `documents`.
* `chunk`, representing each row of `chunks`.

3. A *data source* extracts data from an external source. In this example, the `LocalFile` data source defines a table, each row has `"filename"` and `"content"` fields.
3. A *data source* extracts data from an external source.
In this example, the `LocalFile` data source imports local files as a KTable (table with a key field, see [KTable](../core/data_types#ktable) for details), each row has `"filename"` and `"content"` fields.

4. After defining the table, we extended a new field `"chunks"` to each row by *transforming* the `"content"` field using `SplitRecursively`. The output of the `SplitRecursively` is also a table representing each chunk of the document, with `"location"` and `"text"` fields.
4. After defining the KTable, we extended a new field `"chunks"` to each row by *transforming* the `"content"` field using `SplitRecursively`. The output of the `SplitRecursively` is also a KTable representing each chunk of the document, with `"location"` and `"text"` fields.

5. After defining the table, we extended a new field `"embedding"` to each row by *transforming* the `"text"` field using `SentenceTransformerEmbed`.
5. After defining the KTable, we extended a new field `"embedding"` to each row by *transforming* the `"text"` field using `SentenceTransformerEmbed`.

6. In CocoIndex, a *collector* collects multiple entries of data together. In this example, the `doc_embeddings` collector collects data from all `chunk`s across all `doc`s, and using the collected data to build a vector index `"doc_embeddings"`, using `Postgres`.

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/ops/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Input data:
To see all supported language names and extensions, see [the code](https://github.com/search?q=org%3Acocoindex-io+lang%3Arust++%22static+TREE_SITTER_LANGUAGE_BY_LANG%22&type=code).
If it's unspecified or the specified language is not supported, it will be treated as plain text.

Return type: `Table`, each row represents a chunk, with the following sub fields:
Return type: [KTable](/docs/core/data_types#ktable), each row represents a chunk, with the following sub fields:

* `location` (type: `range`): The location of the chunk.
* `text` (type: `str`): The text of the chunk.
Expand Down
4 changes: 2 additions & 2 deletions docs/docs/ops/sources.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ The spec takes the following fields:

### Schema

The output is a table with the following sub fields:
The output is a [KTable](/docs/core/data_types#ktable) with the following sub fields:
* `filename` (key, type: `str`): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`
* `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file

Expand Down Expand Up @@ -78,7 +78,7 @@ The spec takes the following fields:

### Schema

The output is a table with the following sub fields:
The output is a [KTable](/docs/core/data_types#ktable) with the following sub fields:

* `file_id` (key, type: `str`): the ID of the file in Google Drive.
* `filename` (type: `str`): the filename of the file, without the path, e.g. `"file1.md"`
Expand Down
8 changes: 4 additions & 4 deletions examples/manuals_llm_extraction/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,23 +40,23 @@ class ArgInfo:
class MethodInfo:
"""Information about a method."""
name: str
args: cocoindex.typing.List[ArgInfo]
args: list[ArgInfo]
description: str

@dataclasses.dataclass
class ClassInfo:
"""Information about a class."""
name: str
description: str
methods: cocoindex.typing.List[MethodInfo]
methods: list[MethodInfo]

@dataclasses.dataclass
class ModuleInfo:
"""Information about a Python module."""
title: str
description: str
classes: cocoindex.typing.Table[ClassInfo]
methods: cocoindex.typing.Table[MethodInfo]
classes: list[ClassInfo]
methods: list[MethodInfo]

@dataclasses.dataclass
class ModuleSummary:
Expand Down
3 changes: 2 additions & 1 deletion python/cocoindex/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,5 @@
from .index import VectorSimilarityMetric, VectorIndexDef, IndexOptions
from .auth_registry import AuthEntryReference, add_auth_entry, ref_auth_entry
from .lib import *
from ._engine import OpArgSchema
from ._engine import OpArgSchema
from .typing import Float32, Float64, LocalDateTime, OffsetDateTime, Range, Vector, Json
Loading