You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While collecting the min/max values of columns, we kept the exact values of them. For columns of type string(alike), the min/max values may be large(say, a column of type CHAR(4096)) and that makes the meta files that contain the statistics large.
It would be better if we can trim the strings to some moderate length, say 8 chars, in a way that preserves the property of min/max statistics: the trimmed max should be larger than the non-trimmed one, and the trimmed min should be lesser than the non-trimmed one.
Thus, with some loss of accuracy (slightly more likely to be false-positive, which IMO we can afford), the size of fuse table meta files could be reduced.
Summary
While collecting the min/max values of columns, we kept the exact values of them. For columns of type string(alike), the min/max values may be large(say, a column of type CHAR(4096)) and that makes the meta files that contain the statistics large.
It would be better if we can trim the strings to some moderate length, say 8 chars, in a way that preserves the property of min/max statistics: the trimmed max should be larger than the non-trimmed one, and the trimmed min should be lesser than the non-trimmed one.
Thus, with some loss of accuracy (slightly more likely to be false-positive, which IMO we can afford), the size of fuse table meta files could be reduced.
where the min/max vals are gathered:
https://github.com/datafuselabs/databend/blob/dae90d856e380ea29716e87148cc69d07ccff8ff/src/query/storages/fuse/src/statistics/column_statistic.rs#L45-L53
update : 2022-09-29
also, for column types like variants, we should not keep the min-max stats for themwe have not generated min-max stats for them...
The text was updated successfully, but these errors were encountered: