You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Awesome project! Really nice to see support for semi-structured data here. I'm evaluating using this project for a platform to analyse security data, where logs often follow a loose dynamic schema (e.g. AWS Cloudtrail). Right now, I believe storing this in a JSON column for a schemaless approach will kill any performance benefits (need to load whole column, parse text). Clickhouse recently added native support for Object('JSON') type that abstracts away the details and allows ingesting arbitrary JSON values with full performance, by automatically evolving the schema:
The JSON Object type is advantageous when dealing with complex nested structures, which are subject to change. The type automatically infers the columns from the structure during insertion and merges these into the existing table schema. By storing JSON keys and their values as columns and dynamic subcolumns, ClickHouse can exploit the same optimizations used for structured data and thus provide comparable performance. The user is also provided with an intuitive path syntax for column selection. Furthermore, a table can contain a JSON object column with a flexible schema and more strict conventional columns with predefined types.
Would be a killer feature to have this incorporated into Databend.
Note: In Clickhouse's approach the issue arises of incompatible schemas (int -> string supported, but not int -> Array etc.). I think a better approach to this would be the "dynamic typing" approach of Redshift SUPER and Rockset.
What do you think?
The text was updated successfully, but these errors were encountered:
I believe storing this in a JSON column for a schemaless approach will kill any performance benefits (need to load whole column, parse text).
That's definitely right, thanks for your advices.
Current JSON is the first simple implementation to support this feature, we did know this approach would perform poorly. Besides Redshift SUPER and Rockset, we also discovered sneller which is based on ion.
The main problem in databend is that the schema of table is fixed(due to parquet), thus we can't store different schemas in different data parts if we just infer the schema from input data.
We will keep investigating these to develop a new version of JSON format to optimize the performance. Any new ideas are all welcome.
Understood. Iceberg table format brings nested schema evolution (struct) to parquet data lake by keeping track of column identifiers in metadata files. I guess we first need to add similar support to Databend for Nested and other datatypes, then we can explore the automatic schema evolution feature for JSON objects? Trying to understand if this is something I could take a stab at myself. Thanks
Pretty nice idea! we have discuessed several implementations for further optimization on Nested datatypes, https://github.com/datafuselabs/databend/pull/4320/files, I think you could at first take a look through those discussed researches and promote some new designs and solutions.
Awesome project! Really nice to see support for semi-structured data here. I'm evaluating using this project for a platform to analyse security data, where logs often follow a loose dynamic schema (e.g. AWS Cloudtrail). Right now, I believe storing this in a JSON column for a schemaless approach will kill any performance benefits (need to load whole column, parse text). Clickhouse recently added native support for Object('JSON') type that abstracts away the details and allows ingesting arbitrary JSON values with full performance, by automatically evolving the schema:
https://clickhouse.com/docs/en/guides/developer/working-with-json/json-semi-structured/#overview
Would be a killer feature to have this incorporated into Databend.
Note: In Clickhouse's approach the issue arises of incompatible schemas (int -> string supported, but not int -> Array etc.). I think a better approach to this would be the "dynamic typing" approach of Redshift SUPER and Rockset.
What do you think?
The text was updated successfully, but these errors were encountered: