You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(types): simplify/clarify data types (KTable, LTable, Vector, remove .typing.) (#405)
* feat(types): rename table types List->LTable, Table->KTable, etc.
* feat(types): update Python SDK to support using `dict` for `KTable`
* test: add a test case for struct-typed KTable key
* docs(types): revise documents regarding `KTable` and `LTable`
* feat(types): revise representation for `Vector` type
* style(types): export types to the `cocoindex.` level
* docs(types): update docs for data type
Copy file name to clipboardExpand all lines: docs/docs/core/basics.md
+3-3
Original file line number
Diff line number
Diff line change
@@ -21,9 +21,9 @@ An indexing flow involves source data and transformed data (either as an interme
21
21
22
22
Each piece of data has a **data type**, falling into one of the following categories:
23
23
24
-
* Basic type.
25
-
* Struct type: a collection of **fields**, each with a name and a type.
26
-
*Collection type: a collection of **rows**, each of which is a struct with specified schema. A collection type can be a table (which has a key field) or a list (ordered but without key field).
24
+
**Basic type*.
25
+
**Struct type*: a collection of **fields**, each with a name and a type.
26
+
**Table type*: a collection of **rows**, each of which is a struct with specified schema. A table type can be a *KTable* (which has a key field) or a *LTable* (ordered but without key field).
27
27
28
28
An indexing flow always has a top-level struct, containing all data within and managed by the flow.
Copy file name to clipboardExpand all lines: docs/docs/core/custom_function.mdx
+1-1
Original file line number
Diff line number
Diff line change
@@ -33,7 +33,7 @@ Notes:
33
33
34
34
* The `cocoindex.op.function()` function decorator also takes optional parameters.
35
35
See [Parameters for custom functions](#parameters-for-custom-functions) for details.
36
-
* Types of arugments and the return value must be annotated, so that CocoIndex will have information about data types of the operation's output fields.
36
+
* Types of arguments and the return value must be annotated, so that CocoIndex will have information about data types of the operation's output fields.
37
37
See [Data Types](/docs/core/data_types) for supported types.
| Json ||`cocoindex.typing.Json`| Any type convertible to JSON by `json` package |
36
+
| LocalDatetime | Date and time without timezone |`cocoindex.LocalDateTime`|`datetime.datetime`|
37
+
| OffsetDatetime | Date and time with a timezone offset |`cocoindex.OffsetDateTime`|`datetime.datetime`|
38
+
| Vector[*T*, *Dim*?]|*T* must be basic type. *Dim* is a positive integer and optional. |`cocoindex.Vector[T]` or `cocoindex.Vector[T, Dim]`|`list[T]`|
39
+
| Json ||`cocoindex.Json`| Any data convertible to JSON by `json` package |
40
+
41
+
Values of all data types can be represented by values in Python's native types (as described under the Native Python Type column).
42
+
However, the underlying execution engine and some storage system (like Postgres) has finer distinctions for some types, specifically:
34
43
35
-
For some types, CocoIndex Python SDK provides annotated types with finer granularity than Python's original type, e.g.
36
44
**Float32* and *Float64* for `float`, with different precision.
37
45
**LocalDateTime* and *OffsetDateTime* for `datetime.datetime`, with different timezone awareness.
38
-
**Vector* has dimension information.
46
+
**Vector* has optional dimension information.
47
+
**Range* and *Json* provide a clear tag for the type, to clearly distinguish the type in CocoIndex.
39
48
40
-
When defining [custom functions](/docs/core/custom_function), use the specific types as type annotations for arguments and return values.
41
-
So CocoIndex will have information about the specific type.
49
+
The native Python type is always more permissive and can represent a superset of possible values.
50
+
* Only when you annotate the return type of a custom function, you should use the specific type,
51
+
so that CocoIndex will have information about the precise type to be used in the execution engine and storage system.
52
+
* For all other purposes, e.g. to provide annotation for argument types of a custom function, or used internally in your custom function,
53
+
you can choose whatever to use.
54
+
The native Python type is usually simpler.
42
55
43
56
### Struct Type
44
57
45
-
A struct has a bunch of fields, each with a name and a type.
58
+
A Struct has a bunch of fields, each with a name and a type.
46
59
47
-
In Python, a struct type is represented by a [dataclass](https://docs.python.org/3/library/dataclasses.html),
60
+
In Python, a Struct type is represented by a [dataclass](https://docs.python.org/3/library/dataclasses.html),
48
61
and all fields must be annotated with a specific type. For example:
49
62
50
63
```python
51
64
from dataclasses import dataclass
52
65
53
66
@dataclass
54
-
classOrder:
55
-
order_id: str
56
-
name: str
57
-
price: float
67
+
classPerson:
68
+
first_name: str
69
+
last_name
70
+
dob: datetime.date
58
71
```
59
72
60
-
### Collection Types
73
+
### Table Types
61
74
62
-
A collection type models a collection of rows, each of which is a struct with specific schema.
75
+
A Table type models a collection of rows, each with multiple columns.
76
+
Each column of a table has a specific type.
63
77
64
-
We have two specific types of collection:
78
+
We have two specific types of Table types: KTable and LTable.
65
79
66
-
| Type | Description |Type in Python | Original Type in Python |
If the data slice has `Table` type, you can call `row()` method to obtain a child scope representing each row, to apply operations on each row.
182
+
If the data slice has [table type](/docs/core/data_types#table-types), you can call `row()` method to obtain a child scope representing each row, to apply operations on each row.
182
183
183
184
<Tabs>
184
185
<TabItemvalue="python"label="Python"default>
@@ -281,16 +282,32 @@ The target storage is managed by CocoIndex, i.e. it'll be created by [CocoIndex
281
282
The `name` for the same storage should remain stable across different runs.
282
283
If it changes, CocoIndex will treat it as an old storage removed and a new one created, and perform setup changes and reindexing accordingly.
283
284
284
-
####Storage Indexes
285
+
## Storage Indexes
285
286
286
287
Many storage supports indexes, to boost efficiency in retrieving data.
287
288
CocoIndex provides a common way to configure indexes for various storages.
288
289
289
-
**Primary key*. `primary_key_fields` (`Sequence[str]`): the fields to be used as primary key. Types of the fields must be supported as key fields. See [Key Types](data_types#key-types) for more details.
290
-
**Vector index*. `vector_indexes` (`Sequence[VectorIndexDef]`): the fields to create vector index. `VectorIndexDef` has the following fields:
290
+
### Primary Key
291
+
292
+
*Primary key* is specified by `primary_key_fields` (`Sequence[str]`).
293
+
Types of the fields must be key types. See [Key Types](data_types#key-types) for more details.
294
+
295
+
### Vector Index
296
+
297
+
*Vector index* is specified by `vector_indexes` (`Sequence[VectorIndexDef]`). `VectorIndexDef` has the following fields:
298
+
291
299
*`field_name`: the field to create vector index.
292
-
*`metric`: the similarity metric to use. See [Vector Type](data_types#vector-type) for more details about supported similarity metrics.
300
+
*`metric`: the similarity metric to use.
301
+
302
+
#### Similarity Metrics
303
+
304
+
Following metrics are supported:
293
305
306
+
| Metric Name | Description | Similarity Order |
307
+
|-------------|-------------|------------------|
308
+
| CosineSimilarity |[Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity)| Larger is more similar |
309
+
| L2Distance |[L2 distance (a.k.a. Euclidean distance)](https://en.wikipedia.org/wiki/Euclidean_distance)| Smaller is more similar |
310
+
| InnerProduct |[Inner product](https://en.wikipedia.org/wiki/Inner_product_space)| Larger is more similar |
Copy file name to clipboardExpand all lines: docs/docs/getting_started/quickstart.md
+4-3
Original file line number
Diff line number
Diff line change
@@ -112,11 +112,12 @@ Notes:
112
112
*`doc`, representing each row of `documents`.
113
113
*`chunk`, representing each row of `chunks`.
114
114
115
-
3. A *data source* extracts data from an external source. In this example, the `LocalFile` data source defines a table, each row has `"filename"` and `"content"` fields.
115
+
3. A *data source* extracts data from an external source.
116
+
In this example, the `LocalFile` data source imports local files as a KTable (table with a key field, see [KTable](../core/data_types#ktable) for details), each row has `"filename"` and `"content"` fields.
116
117
117
-
4. After defining the table, we extended a new field `"chunks"` to each row by *transforming* the `"content"` field using `SplitRecursively`. The output of the `SplitRecursively` is also a table representing each chunk of the document, with `"location"` and `"text"` fields.
118
+
4. After defining the KTable, we extended a new field `"chunks"` to each row by *transforming* the `"content"` field using `SplitRecursively`. The output of the `SplitRecursively` is also a KTable representing each chunk of the document, with `"location"` and `"text"` fields.
118
119
119
-
5. After defining the table, we extended a new field `"embedding"` to each row by *transforming* the `"text"` field using `SentenceTransformerEmbed`.
120
+
5. After defining the KTable, we extended a new field `"embedding"` to each row by *transforming* the `"text"` field using `SentenceTransformerEmbed`.
120
121
121
122
6. In CocoIndex, a *collector* collects multiple entries of data together. In this example, the `doc_embeddings` collector collects data from all `chunk`s across all `doc`s, and using the collected data to build a vector index `"doc_embeddings"`, using `Postgres`.
Copy file name to clipboardExpand all lines: docs/docs/ops/functions.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -32,7 +32,7 @@ Input data:
32
32
To see all supported language names and extensions, see [the code](https://github.com/search?q=org%3Acocoindex-io+lang%3Arust++%22static+TREE_SITTER_LANGUAGE_BY_LANG%22&type=code).
33
33
If it's unspecified or the specified language is not supported, it will be treated as plain text.
34
34
35
-
Return type: `Table`, each row represents a chunk, with the following sub fields:
35
+
Return type: [KTable](/docs/core/data_types#ktable), each row represents a chunk, with the following sub fields:
36
36
37
37
*`location` (type: `range`): The location of the chunk.
0 commit comments