Skip to content

Commit 11268e0

Browse files
authored
feat(types): simplify/clarify data types (KTable, LTable, Vector, remove .typing.) (#405)
* feat(types): rename table types List->LTable, Table->KTable, etc. * feat(types): update Python SDK to support using `dict` for `KTable` * test: add a test case for struct-typed KTable key * docs(types): revise documents regarding `KTable` and `LTable` * feat(types): revise representation for `Vector` type * style(types): export types to the `cocoindex.` level * docs(types): update docs for data type
1 parent 08b428d commit 11268e0

30 files changed

+779
-562
lines changed

Cargo.lock

+307-214
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/docs/core/basics.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,9 @@ An indexing flow involves source data and transformed data (either as an interme
2121

2222
Each piece of data has a **data type**, falling into one of the following categories:
2323

24-
* Basic type.
25-
* Struct type: a collection of **fields**, each with a name and a type.
26-
* Collection type: a collection of **rows**, each of which is a struct with specified schema. A collection type can be a table (which has a key field) or a list (ordered but without key field).
24+
* *Basic type*.
25+
* *Struct type*: a collection of **fields**, each with a name and a type.
26+
* *Table type*: a collection of **rows**, each of which is a struct with specified schema. A table type can be a *KTable* (which has a key field) or a *LTable* (ordered but without key field).
2727

2828
An indexing flow always has a top-level struct, containing all data within and managed by the flow.
2929

docs/docs/core/custom_function.mdx

+1-1
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Notes:
3333

3434
* The `cocoindex.op.function()` function decorator also takes optional parameters.
3535
See [Parameters for custom functions](#parameters-for-custom-functions) for details.
36-
* Types of arugments and the return value must be annotated, so that CocoIndex will have information about data types of the operation's output fields.
36+
* Types of arguments and the return value must be annotated, so that CocoIndex will have information about data types of the operation's output fields.
3737
See [Data Types](/docs/core/data_types) for supported types.
3838

3939
</TabItem>

docs/docs/core/data_types.mdx

+69-47
Original file line numberDiff line numberDiff line change
@@ -9,91 +9,113 @@ In CocoIndex, all data processed by the flow have a type determined when the flo
99

1010
This makes schema of data processed by CocoIndex clear, and easily determine the schema of your index.
1111

12-
## Data Types
12+
## Data Types
13+
14+
You don't need to spell out data types in CocoIndex, when you define the flow using existing operations (source, function, etc).
15+
These operations decide data types of fields produced by them based on the spec and input data types.
16+
All you need to do is to make sure the data passed to functions and storage targets are accepted by them.
17+
18+
When you define [custom functions](/docs/core/custom_function), you need to specify the data types of arguments and return values.
1319

1420
### Basic Types
1521

1622
This is the list of all basic types supported by CocoIndex:
1723

18-
| Type | Description |Type in Python | Original Type in Python |
24+
| Type | Description | Specific Python Type | Native Python Type |
1925
|------|-------------|---------------|-------------------------|
2026
| Bytes | | `bytes` | `bytes` |
2127
| Str | | `str` | `str` |
2228
| Bool | | `bool` | `bool` |
2329
| Int64 | | `int` | `int` |
24-
| Float32 | | `cocoindex.typing.Float32` |`float` |
25-
| Float64 | | `cocoindex.typing.Float64` |`float` |
26-
| Range | | `cocoindex.typing.Range` | `tuple[int, int]` |
30+
| Float32 | | `cocoindex.Float32` |`float` |
31+
| Float64 | | `cocoindex.Float64` |`float` |
32+
| Range | | `cocoindex.Range` | `tuple[int, int]` |
2733
| Uuid | | `uuid.UUId` | `uuid.UUID` |
2834
| Date | | `datetime.date` | `datetime.date` |
2935
| Time | | `datetime.time` | `datetime.time` |
30-
| LocalDatetime | Date and time without timezone | `cocoindex.typing.LocalDateTime` | `datetime.datetime` |
31-
| OffsetDatetime | Date and time with a timezone offset | `cocoindex.typing.OffsetDateTime` | `datetime.datetime` |
32-
| Vector[*type*, *N*?] | |`Annotated[list[type], cocoindex.typing.Vector(dim=N)]` | `list[type]` |
33-
| Json | | `cocoindex.typing.Json` | Any type convertible to JSON by `json` package |
36+
| LocalDatetime | Date and time without timezone | `cocoindex.LocalDateTime` | `datetime.datetime` |
37+
| OffsetDatetime | Date and time with a timezone offset | `cocoindex.OffsetDateTime` | `datetime.datetime` |
38+
| Vector[*T*, *Dim*?] | *T* must be basic type. *Dim* is a positive integer and optional. |`cocoindex.Vector[T]` or `cocoindex.Vector[T, Dim]` | `list[T]` |
39+
| Json | | `cocoindex.Json` | Any data convertible to JSON by `json` package |
40+
41+
Values of all data types can be represented by values in Python's native types (as described under the Native Python Type column).
42+
However, the underlying execution engine and some storage system (like Postgres) has finer distinctions for some types, specifically:
3443

35-
For some types, CocoIndex Python SDK provides annotated types with finer granularity than Python's original type, e.g.
3644
* *Float32* and *Float64* for `float`, with different precision.
3745
* *LocalDateTime* and *OffsetDateTime* for `datetime.datetime`, with different timezone awareness.
38-
* *Vector* has dimension information.
46+
* *Vector* has optional dimension information.
47+
* *Range* and *Json* provide a clear tag for the type, to clearly distinguish the type in CocoIndex.
3948

40-
When defining [custom functions](/docs/core/custom_function), use the specific types as type annotations for arguments and return values.
41-
So CocoIndex will have information about the specific type.
49+
The native Python type is always more permissive and can represent a superset of possible values.
50+
* Only when you annotate the return type of a custom function, you should use the specific type,
51+
so that CocoIndex will have information about the precise type to be used in the execution engine and storage system.
52+
* For all other purposes, e.g. to provide annotation for argument types of a custom function, or used internally in your custom function,
53+
you can choose whatever to use.
54+
The native Python type is usually simpler.
4255

4356
### Struct Type
4457

45-
A struct has a bunch of fields, each with a name and a type.
58+
A Struct has a bunch of fields, each with a name and a type.
4659

47-
In Python, a struct type is represented by a [dataclass](https://docs.python.org/3/library/dataclasses.html),
60+
In Python, a Struct type is represented by a [dataclass](https://docs.python.org/3/library/dataclasses.html),
4861
and all fields must be annotated with a specific type. For example:
4962

5063
```python
5164
from dataclasses import dataclass
5265

5366
@dataclass
54-
class Order:
55-
order_id: str
56-
name: str
57-
price: float
67+
class Person:
68+
first_name: str
69+
last_name
70+
dob: datetime.date
5871
```
5972

60-
### Collection Types
73+
### Table Types
6174

62-
A collection type models a collection of rows, each of which is a struct with specific schema.
75+
A Table type models a collection of rows, each with multiple columns.
76+
Each column of a table has a specific type.
6377

64-
We have two specific types of collection:
78+
We have two specific types of Table types: KTable and LTable.
6579

66-
| Type | Description |Type in Python | Original Type in Python |
67-
|------|-------------|---------------|-------------------------|
68-
| Table[*type*] | The first field is the key, and CocoIndex enforces its uniqueness | `cocoindex.typing.Table[type]` | `list[type]` |
69-
| List[*type*] | No key field; row order is preserved | `cocoindex.typing.List[type]` | `list[type]` |
80+
#### KTable
7081

71-
For example, we can use `cocoindex.typing.Table[Order]` to represent a table of orders, and the first field `order_id` will be taken as the key field.
82+
KTable is a Table type whose first column serves as the key.
83+
The row order of a KTable is not preserved.
84+
Type of the first column (key column) must be a [key type](#key-types).
7285

73-
## Types to Create Indexes
86+
In Python, a KTable type is represented by `dict[K, V]`.
87+
The `V` should be a dataclass, representing the value fields of each row.
88+
For example, you can use `dict[str, Person]` to represent a KTable, with 4 columns: key (Str), `first_name` (Str), `last_name` (Str), `dob` (Date).
7489

75-
### Key Types
90+
Note that if you want to use a struct as the key, you need to annotate the struct with `@dataclass(frozen=True)`, so the values are immutable.
91+
For example:
7692

77-
Currently, the following types are supported as types for key fields:
93+
```python
94+
@dataclass(frozen=True)
95+
class PersonKey:
96+
id_kind: str
97+
id: str
98+
```
7899

79-
- `bytes`
80-
- `str`
81-
- `bool`
82-
- `int64`
83-
- `range`
84-
- `uuid`
85-
- `date`
86-
- Struct with all fields being key types
100+
Then you can use `dict[PersonKey, Person]` to represent a KTable keyed by `PersonKey`.
101+
102+
103+
#### LTable
87104

88-
### Vector Type
105+
LTable is a Table type whose row order is preserved. LTable has no key column.
89106

90-
Users can create vector index on fields with `vector` types.
91-
A vector index also needs to be configured with a similarity metric, and the index is only effective when this metric is used during retrieval.
107+
In Python, a LTable type is represented by `list[R]`, where `R` is a dataclass representing a row.
108+
For example, you can use `list[Person]` to represent a LTable with 3 columns: `first_name` (Str), `last_name` (Str), `dob` (Date).
92109

93-
Following metrics are supported:
110+
## Key Types
94111

95-
| Metric Name | Description | Similarity Order |
96-
|-------------|-------------|------------------|
97-
| `CosineSimilarity` | [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) | Larger is more similar |
98-
| `L2Distance` | [L2 distance (a.k.a. Euclidean distance)](https://en.wikipedia.org/wiki/Euclidean_distance) | Smaller is more similar |
99-
| `InnerProduct` | [Inner product](https://en.wikipedia.org/wiki/Inner_product_space) | Larger is more similar |
112+
Currently, the following types are key types
113+
114+
- Bytes
115+
- Str
116+
- Bool
117+
- Int64
118+
- Range
119+
- Uuid
120+
- Date
121+
- Struct with all fields being key types

docs/docs/core/flow_def.mdx

+22-5
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
---
22
title: Flow Definition
33
description: Define a CocoIndex flow, by specifying source, transformations and storages, and connect input/output data of them.
4+
toc_max_heading_level: 4
45
---
56

67
import Tabs from '@theme/Tabs';
@@ -178,7 +179,7 @@ def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataSco
178179

179180
### For each row
180181

181-
If the data slice has `Table` type, you can call `row()` method to obtain a child scope representing each row, to apply operations on each row.
182+
If the data slice has [table type](/docs/core/data_types#table-types), you can call `row()` method to obtain a child scope representing each row, to apply operations on each row.
182183

183184
<Tabs>
184185
<TabItem value="python" label="Python" default>
@@ -281,16 +282,32 @@ The target storage is managed by CocoIndex, i.e. it'll be created by [CocoIndex
281282
The `name` for the same storage should remain stable across different runs.
282283
If it changes, CocoIndex will treat it as an old storage removed and a new one created, and perform setup changes and reindexing accordingly.
283284

284-
#### Storage Indexes
285+
## Storage Indexes
285286

286287
Many storage supports indexes, to boost efficiency in retrieving data.
287288
CocoIndex provides a common way to configure indexes for various storages.
288289

289-
* *Primary key*. `primary_key_fields` (`Sequence[str]`): the fields to be used as primary key. Types of the fields must be supported as key fields. See [Key Types](data_types#key-types) for more details.
290-
* *Vector index*. `vector_indexes` (`Sequence[VectorIndexDef]`): the fields to create vector index. `VectorIndexDef` has the following fields:
290+
### Primary Key
291+
292+
*Primary key* is specified by `primary_key_fields` (`Sequence[str]`).
293+
Types of the fields must be key types. See [Key Types](data_types#key-types) for more details.
294+
295+
### Vector Index
296+
297+
*Vector index* is specified by `vector_indexes` (`Sequence[VectorIndexDef]`). `VectorIndexDef` has the following fields:
298+
291299
* `field_name`: the field to create vector index.
292-
* `metric`: the similarity metric to use. See [Vector Type](data_types#vector-type) for more details about supported similarity metrics.
300+
* `metric`: the similarity metric to use.
301+
302+
#### Similarity Metrics
303+
304+
Following metrics are supported:
293305

306+
| Metric Name | Description | Similarity Order |
307+
|-------------|-------------|------------------|
308+
| CosineSimilarity | [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) | Larger is more similar |
309+
| L2Distance | [L2 distance (a.k.a. Euclidean distance)](https://en.wikipedia.org/wiki/Euclidean_distance) | Smaller is more similar |
310+
| InnerProduct | [Inner product](https://en.wikipedia.org/wiki/Inner_product_space) | Larger is more similar |
294311

295312
## Miscellaneous
296313

docs/docs/getting_started/quickstart.md

+4-3
Original file line numberDiff line numberDiff line change
@@ -112,11 +112,12 @@ Notes:
112112
* `doc`, representing each row of `documents`.
113113
* `chunk`, representing each row of `chunks`.
114114

115-
3. A *data source* extracts data from an external source. In this example, the `LocalFile` data source defines a table, each row has `"filename"` and `"content"` fields.
115+
3. A *data source* extracts data from an external source.
116+
In this example, the `LocalFile` data source imports local files as a KTable (table with a key field, see [KTable](../core/data_types#ktable) for details), each row has `"filename"` and `"content"` fields.
116117

117-
4. After defining the table, we extended a new field `"chunks"` to each row by *transforming* the `"content"` field using `SplitRecursively`. The output of the `SplitRecursively` is also a table representing each chunk of the document, with `"location"` and `"text"` fields.
118+
4. After defining the KTable, we extended a new field `"chunks"` to each row by *transforming* the `"content"` field using `SplitRecursively`. The output of the `SplitRecursively` is also a KTable representing each chunk of the document, with `"location"` and `"text"` fields.
118119

119-
5. After defining the table, we extended a new field `"embedding"` to each row by *transforming* the `"text"` field using `SentenceTransformerEmbed`.
120+
5. After defining the KTable, we extended a new field `"embedding"` to each row by *transforming* the `"text"` field using `SentenceTransformerEmbed`.
120121

121122
6. In CocoIndex, a *collector* collects multiple entries of data together. In this example, the `doc_embeddings` collector collects data from all `chunk`s across all `doc`s, and using the collected data to build a vector index `"doc_embeddings"`, using `Postgres`.
122123

docs/docs/ops/functions.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ Input data:
3232
To see all supported language names and extensions, see [the code](https://github.com/search?q=org%3Acocoindex-io+lang%3Arust++%22static+TREE_SITTER_LANGUAGE_BY_LANG%22&type=code).
3333
If it's unspecified or the specified language is not supported, it will be treated as plain text.
3434

35-
Return type: `Table`, each row represents a chunk, with the following sub fields:
35+
Return type: [KTable](/docs/core/data_types#ktable), each row represents a chunk, with the following sub fields:
3636

3737
* `location` (type: `range`): The location of the chunk.
3838
* `text` (type: `str`): The text of the chunk.

docs/docs/ops/sources.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ The spec takes the following fields:
2828

2929
### Schema
3030

31-
The output is a table with the following sub fields:
31+
The output is a [KTable](/docs/core/data_types#ktable) with the following sub fields:
3232
* `filename` (key, type: `str`): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`
3333
* `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file
3434

@@ -78,7 +78,7 @@ The spec takes the following fields:
7878

7979
### Schema
8080

81-
The output is a table with the following sub fields:
81+
The output is a [KTable](/docs/core/data_types#ktable) with the following sub fields:
8282

8383
* `file_id` (key, type: `str`): the ID of the file in Google Drive.
8484
* `filename` (type: `str`): the filename of the file, without the path, e.g. `"file1.md"`

examples/manuals_llm_extraction/main.py

+4-4
Original file line numberDiff line numberDiff line change
@@ -40,23 +40,23 @@ class ArgInfo:
4040
class MethodInfo:
4141
"""Information about a method."""
4242
name: str
43-
args: cocoindex.typing.List[ArgInfo]
43+
args: list[ArgInfo]
4444
description: str
4545

4646
@dataclasses.dataclass
4747
class ClassInfo:
4848
"""Information about a class."""
4949
name: str
5050
description: str
51-
methods: cocoindex.typing.List[MethodInfo]
51+
methods: list[MethodInfo]
5252

5353
@dataclasses.dataclass
5454
class ModuleInfo:
5555
"""Information about a Python module."""
5656
title: str
5757
description: str
58-
classes: cocoindex.typing.Table[ClassInfo]
59-
methods: cocoindex.typing.Table[MethodInfo]
58+
classes: list[ClassInfo]
59+
methods: list[MethodInfo]
6060

6161
@dataclasses.dataclass
6262
class ModuleSummary:

python/cocoindex/__init__.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -9,4 +9,5 @@
99
from .index import VectorSimilarityMetric, VectorIndexDef, IndexOptions
1010
from .auth_registry import AuthEntryReference, add_auth_entry, ref_auth_entry
1111
from .lib import *
12-
from ._engine import OpArgSchema
12+
from ._engine import OpArgSchema
13+
from .typing import Float32, Float64, LocalDateTime, OffsetDateTime, Range, Vector, Json

0 commit comments

Comments
 (0)