cocoindex-io · badmonster0 · Apr 29, 2025 · Apr 28, 2025 · Apr 28, 2025 · Apr 28, 2025
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/docs/docs/core/basics.md b/docs/docs/core/basics.md
@@ -21,9 +21,9 @@ An indexing flow involves source data and transformed data (either as an interme
 
 Each piece of data has a **data type**, falling into one of the following categories:
 
-*   Basic type.
-*   Struct type: a collection of **fields**, each with a name and a type.
-*   Collection type: a collection of **rows**, each of which is a struct with specified schema. A collection type can be a table (which has a key field) or a list (ordered but without key field).
+*   *Basic type*.
+*   *Struct type*: a collection of **fields**, each with a name and a type.
+*   *Table type*: a collection of **rows**, each of which is a struct with specified schema. A table type can be a *KTable* (which has a key field) or a *LTable* (ordered but without key field).
 
 An indexing flow always has a top-level struct, containing all data within and managed by the flow.
 

diff --git a/docs/docs/core/custom_function.mdx b/docs/docs/core/custom_function.mdx
@@ -33,7 +33,7 @@ Notes:
 
 *   The `cocoindex.op.function()` function decorator also takes optional parameters.
     See [Parameters for custom functions](#parameters-for-custom-functions) for details.
-*   Types of arugments and the return value must be annotated, so that CocoIndex will have information about data types of the operation's output fields.
+*   Types of arguments and the return value must be annotated, so that CocoIndex will have information about data types of the operation's output fields.
     See [Data Types](/docs/core/data_types) for supported types.
 
 </TabItem>

diff --git a/docs/docs/core/data_types.mdx b/docs/docs/core/data_types.mdx
@@ -9,91 +9,113 @@ In CocoIndex, all data processed by the flow have a type determined when the flo
 
 This makes schema of data processed by CocoIndex clear, and easily determine the schema of your index.
 
-## Data Types 
+## Data Types
+
+You don't need to spell out data types in CocoIndex, when you define the flow using existing operations (source, function, etc).
+These operations decide data types of fields produced by them based on the spec and input data types.
+All you need to do is to make sure the data passed to functions and storage targets are accepted by them.
+
+When you define [custom functions](/docs/core/custom_function), you need to specify the data types of arguments and return values.
 
 ### Basic Types
 
 This is the list of all basic types supported by CocoIndex:
 
-| Type | Description |Type in Python | Original Type in Python |
+| Type | Description | Specific Python Type | Native Python Type |
 |------|-------------|---------------|-------------------------|
 | Bytes | | `bytes` | `bytes` |
 | Str | | `str` | `str` |
 | Bool | | `bool` | `bool` |
 | Int64 | | `int` | `int` |
-| Float32 | | `cocoindex.typing.Float32` |`float` | 
-| Float64 | |  `cocoindex.typing.Float64` |`float` |
-| Range | | `cocoindex.typing.Range`  | `tuple[int, int]` |
+| Float32 | | `cocoindex.Float32` |`float` | 
+| Float64 | |  `cocoindex.Float64` |`float` |
+| Range | | `cocoindex.Range`  | `tuple[int, int]` |
 | Uuid | | `uuid.UUId` | `uuid.UUID` |
 | Date | | `datetime.date` | `datetime.date` |
 | Time | | `datetime.time` | `datetime.time` |
-| LocalDatetime | Date and time without timezone | `cocoindex.typing.LocalDateTime` | `datetime.datetime` |
-| OffsetDatetime | Date and time with a timezone offset | `cocoindex.typing.OffsetDateTime` | `datetime.datetime` |
-| Vector[*type*, *N*?] | |`Annotated[list[type], cocoindex.typing.Vector(dim=N)]` | `list[type]` | 
-| Json | | `cocoindex.typing.Json` | Any type convertible to JSON by `json` package | 
+| LocalDatetime | Date and time without timezone | `cocoindex.LocalDateTime` | `datetime.datetime` |
+| OffsetDatetime | Date and time with a timezone offset | `cocoindex.OffsetDateTime` | `datetime.datetime` |
+| Vector[*T*, *Dim*?] | *T* must be basic type. *Dim* is a positive integer and optional. |`cocoindex.Vector[T]` or `cocoindex.Vector[T, Dim]` | `list[T]` | 
+| Json | | `cocoindex.Json` | Any data convertible to JSON by `json` package | 
+
+Values of all data types can be represented by values in Python's native types (as described under the Native Python Type column).
+However, the underlying execution engine and some storage system (like Postgres) has finer distinctions for some types, specifically:
 
-For some types, CocoIndex Python SDK provides annotated types with finer granularity than Python's original type, e.g.
 *   *Float32* and *Float64* for `float`, with different precision.
 *   *LocalDateTime* and *OffsetDateTime* for `datetime.datetime`, with different timezone awareness.
-*   *Vector* has dimension information.
+*   *Vector* has optional dimension information.
+*   *Range* and *Json* provide a clear tag for the type, to clearly distinguish the type in CocoIndex.
 
-When defining [custom functions](/docs/core/custom_function), use the specific types as type annotations for arguments and return values.
-So CocoIndex will have information about the specific type.
+The native Python type is always more permissive and can represent a superset of possible values.
+*   Only when you annotate the return type of a custom function, you should use the specific type,
+    so that CocoIndex will have information about the precise type to be used in the execution engine and storage system.
+*   For all other purposes, e.g. to provide annotation for argument types of a custom function, or used internally in your custom function,
+    you can choose whatever to use.
+    The native Python type is usually simpler.
 
 ### Struct Type
 
-A struct has a bunch of fields, each with a name and a type.
+A Struct has a bunch of fields, each with a name and a type.
 
-In Python, a struct type is represented by a [dataclass](https://docs.python.org/3/library/dataclasses.html),
+In Python, a Struct type is represented by a [dataclass](https://docs.python.org/3/library/dataclasses.html),
 and all fields must be annotated with a specific type. For example:
 
 ```python
 from dataclasses import dataclass
 
 @dataclass
-class Order:
-    order_id: str
-    name: str
-    price: float
+class Person:
+    first_name: str
+    last_name
+    dob: datetime.date
 ```
 
-### Collection Types
+### Table Types
 
-A collection type models a collection of rows, each of which is a struct with specific schema.
+A Table type models a collection of rows, each with multiple columns.
+Each column of a table has a specific type.
 
-We have two specific types of collection:
+We have two specific types of Table types: KTable and LTable.
 
-| Type | Description |Type in Python | Original Type in Python |
-|------|-------------|---------------|-------------------------|
-| Table[*type*] | The first field is the key, and CocoIndex enforces its uniqueness | `cocoindex.typing.Table[type]` | `list[type]` |
-| List[*type*] | No key field; row order is preserved | `cocoindex.typing.List[type]` | `list[type]` |
+#### KTable
 
-For example, we can use `cocoindex.typing.Table[Order]` to represent a table of orders, and the first field `order_id` will be taken as the key field.
+KTable is a Table type whose first column serves as the key.
+The row order of a KTable is not preserved.
+Type of the first column (key column) must be a [key type](#key-types).
 
-## Types to Create Indexes
+In Python, a KTable type is represented by `dict[K, V]`. 
+The `V` should be a dataclass, representing the value fields of each row.
+For example, you can use `dict[str, Person]` to represent a KTable, with 4 columns: key (Str), `first_name` (Str), `last_name` (Str), `dob` (Date).
 
-### Key Types
+Note that if you want to use a struct as the key, you need to annotate the struct with `@dataclass(frozen=True)`, so the values are immutable.
+For example:
 
-Currently, the following types are supported as types for key fields:
+```python
+@dataclass(frozen=True)
+class PersonKey:
+    id_kind: str
+    id: str
+```
 
-- `bytes`
-- `str`
-- `bool`
-- `int64`
-- `range`
-- `uuid`
-- `date`
-- Struct with all fields being key types
+Then you can use `dict[PersonKey, Person]` to represent a KTable keyed by `PersonKey`.
+
+
+#### LTable
 
-### Vector Type
+LTable is a Table type whose row order is preserved. LTable has no key column.
 
-Users can create vector index on fields with `vector` types.
-A vector index also needs to be configured with a similarity metric, and the index is only effective when this metric is used during retrieval.
+In Python, a LTable type is represented by `list[R]`, where `R` is a dataclass representing a row.
+For example, you can use `list[Person]` to represent a LTable with 3 columns: `first_name` (Str), `last_name` (Str), `dob` (Date).
 
-Following metrics are supported:
+## Key Types
 
-| Metric Name | Description | Similarity Order |
-|-------------|-------------|------------------|
-| `CosineSimilarity` | [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) | Larger is more similar |
-| `L2Distance` | [L2 distance (a.k.a. Euclidean distance)](https://en.wikipedia.org/wiki/Euclidean_distance) | Smaller is more similar |
-| `InnerProduct` | [Inner product](https://en.wikipedia.org/wiki/Inner_product_space) | Larger is more similar |
+Currently, the following types are key types
+
+- Bytes
+- Str
+- Bool
+- Int64
+- Range
+- Uuid
+- Date
+- Struct with all fields being key types
diff --git a/docs/docs/core/flow_def.mdx b/docs/docs/core/flow_def.mdx
@@ -1,6 +1,7 @@
 ---
 title: Flow Definition
 description: Define a CocoIndex flow, by specifying source, transformations and storages, and connect input/output data of them.
+toc_max_heading_level: 4
 ---
 
 import Tabs from '@theme/Tabs';
@@ -178,7 +179,7 @@ def demo_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataSco
 
 ### For each row
 
-If the data slice has `Table` type, you can call `row()` method to obtain a child scope representing each row, to apply operations on each row.
+If the data slice has [table type](/docs/core/data_types#table-types), you can call `row()` method to obtain a child scope representing each row, to apply operations on each row.
 
 <Tabs>
 <TabItem value="python" label="Python" default>
@@ -281,16 +282,32 @@ The target storage is managed by CocoIndex, i.e. it'll be created by [CocoIndex
 The `name` for the same storage should remain stable across different runs.
 If it changes, CocoIndex will treat it as an old storage removed and a new one created, and perform setup changes and reindexing accordingly.
 
-#### Storage Indexes
+## Storage Indexes
 
 Many storage supports indexes, to boost efficiency in retrieving data.
 CocoIndex provides a common way to configure indexes for various storages.
 
-*   *Primary key*. `primary_key_fields` (`Sequence[str]`): the fields to be used as primary key. Types of the fields must be supported as key fields. See [Key Types](data_types#key-types) for more details.
-*   *Vector index*. `vector_indexes` (`Sequence[VectorIndexDef]`): the fields to create vector index. `VectorIndexDef` has the following fields:
+### Primary Key
+
+*Primary key* is specified by `primary_key_fields` (`Sequence[str]`).
+Types of the fields must be key types. See [Key Types](data_types#key-types) for more details.
+
+### Vector Index
+
+*Vector index* is specified by `vector_indexes` (`Sequence[VectorIndexDef]`). `VectorIndexDef` has the following fields:
+
     *   `field_name`: the field to create vector index.
-    *   `metric`: the similarity metric to use. See [Vector Type](data_types#vector-type) for more details about supported similarity metrics.
+    *   `metric`: the similarity metric to use.
+
+#### Similarity Metrics
+
+Following metrics are supported:
 
+| Metric Name | Description | Similarity Order |
+|-------------|-------------|------------------|
+| CosineSimilarity | [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) | Larger is more similar |
+| L2Distance | [L2 distance (a.k.a. Euclidean distance)](https://en.wikipedia.org/wiki/Euclidean_distance) | Smaller is more similar |
+| InnerProduct | [Inner product](https://en.wikipedia.org/wiki/Inner_product_space) | Larger is more similar |
 
 ## Miscellaneous
 

diff --git a/docs/docs/getting_started/quickstart.md b/docs/docs/getting_started/quickstart.md
@@ -112,11 +112,12 @@ Notes:
     *   `doc`, representing each row of `documents`.
     *   `chunk`, representing each row of `chunks`.
 
-3.  A *data source* extracts data from an external source. In this example, the `LocalFile` data source defines a table, each row has `"filename"` and `"content"` fields.
+3.  A *data source* extracts data from an external source.
+    In this example, the `LocalFile` data source imports local files as a KTable (table with a key field, see [KTable](../core/data_types#ktable) for details), each row has `"filename"` and `"content"` fields.
 
-4. After defining the table, we extended a new field `"chunks"` to each row by *transforming* the `"content"` field using `SplitRecursively`. The output of the `SplitRecursively` is also a table representing each chunk of the document, with `"location"` and `"text"` fields.
+4. After defining the KTable, we extended a new field `"chunks"` to each row by *transforming* the `"content"` field using `SplitRecursively`. The output of the `SplitRecursively` is also a KTable representing each chunk of the document, with `"location"` and `"text"` fields.
 
-5. After defining the table, we extended a new field `"embedding"` to each row by *transforming* the `"text"` field using `SentenceTransformerEmbed`.
+5. After defining the KTable, we extended a new field `"embedding"` to each row by *transforming* the `"text"` field using `SentenceTransformerEmbed`.
 
 6. In CocoIndex, a *collector* collects multiple entries of data together. In this example, the `doc_embeddings` collector collects data from all `chunk`s across all `doc`s, and using the collected data to build a vector index `"doc_embeddings"`, using `Postgres`.
 

diff --git a/docs/docs/ops/functions.md b/docs/docs/ops/functions.md
@@ -32,7 +32,7 @@ Input data:
     To see all supported language names and extensions, see [the code](https://github.com/search?q=org%3Acocoindex-io+lang%3Arust++%22static+TREE_SITTER_LANGUAGE_BY_LANG%22&type=code).
     If it's unspecified or the specified language is not supported, it will be treated as plain text.
 
-Return type: `Table`, each row represents a chunk, with the following sub fields:
+Return type: [KTable](/docs/core/data_types#ktable), each row represents a chunk, with the following sub fields:
 
 *   `location` (type: `range`): The location of the chunk.
 *   `text` (type: `str`): The text of the chunk.

diff --git a/docs/docs/ops/sources.md b/docs/docs/ops/sources.md
@@ -28,7 +28,7 @@ The spec takes the following fields:
 
 ### Schema
 
-The output is a table with the following sub fields:
+The output is a [KTable](/docs/core/data_types#ktable) with the following sub fields:
 *   `filename` (key, type: `str`): the filename of the file, including the path, relative to the root directory, e.g. `"dir1/file1.md"`
 *   `content` (type: `str` if `binary` is `False`, otherwise `bytes`): the content of the file
 
@@ -78,7 +78,7 @@ The spec takes the following fields:
 
 ### Schema
 
-The output is a table with the following sub fields:
+The output is a [KTable](/docs/core/data_types#ktable) with the following sub fields:
 
 *   `file_id` (key, type: `str`): the ID of the file in Google Drive.
 *   `filename` (type: `str`): the filename of the file, without the path, e.g. `"file1.md"`

diff --git a/examples/manuals_llm_extraction/main.py b/examples/manuals_llm_extraction/main.py
@@ -40,23 +40,23 @@ class ArgInfo:
 class MethodInfo:
     """Information about a method."""
     name: str
-    args: cocoindex.typing.List[ArgInfo]
+    args: list[ArgInfo]
     description: str
 
 @dataclasses.dataclass
 class ClassInfo:
     """Information about a class."""
     name: str
     description: str
-    methods: cocoindex.typing.List[MethodInfo]
+    methods: list[MethodInfo]
 
 @dataclasses.dataclass
 class ModuleInfo:
     """Information about a Python module."""
     title: str
     description: str
-    classes: cocoindex.typing.Table[ClassInfo]
-    methods: cocoindex.typing.Table[MethodInfo]
+    classes: list[ClassInfo]
+    methods: list[MethodInfo]
 
 @dataclasses.dataclass
 class ModuleSummary:

diff --git a/python/cocoindex/__init__.py b/python/cocoindex/__init__.py
@@ -9,4 +9,5 @@
 from .index import VectorSimilarityMetric, VectorIndexDef, IndexOptions
 from .auth_registry import AuthEntryReference, add_auth_entry, ref_auth_entry
 from .lib import *
-from ._engine import OpArgSchema
+from ._engine import OpArgSchema
+from .typing import Float32, Float64, LocalDateTime, OffsetDateTime, Range, Vector, Json