You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Because all of these items' sizes are not big and fit the memory well.
To understand more about how the Databend cache works, for example:
Q1:
select id, name, age, city from t1 where age > 20 and age < 30;
The progress likes:
Read the latest snapshot file -- cached in memory
Pruning the segment file and reading it from the s3 -- cached in memory
Read the id&name&age&city column(In Databend , it is called block and in Parquet format) from the s3 -- no cached
So if we run another SQL, Q2:
select id, name, age, city from t1 where age > 25 and age < 30;
The progress is as follows:
Read the latest snapshot file from memory
Pruning the segment file and reading from memory
Read the id&name&age&city column(In Databend , it is called block and in Parquet format) from the s3 -- no cached
If we cache the column parquet files in step 3, the Q2 will avoid file reads from s3, faster!
Update:
Each column file(AKA block file) in Databend is a parquet file with one RowGroup, and the range reader is the RowGroup data, which is the entire parquet file except the footer.
Question
To avoid writing through blocks to disk(which would affect reads), we should use a Memory + Disk LRU cache, such as memory size: 1GB, and Disk size: 10GB, and make the block write-back async.
At last, I'd like to show the performance gains Snowflake has made thanks to caching for the hits dataset test:
Run Q1: SELECT COUNT() FROM hits.public.hits2;
Then Q2: SELECT COUNT() FROM hits.public.hits2 WHERE AdvEngineID <> 0;
The text was updated successfully, but these errors were encountered:
Summary
Now, Databend will cache these in memory:
Because all of these items' sizes are not big and fit the memory well.
To understand more about how the Databend cache works, for example:
Q1:
The progress likes:
So if we run another SQL, Q2:
The progress is as follows:
If we cache the column parquet files in step 3, the Q2 will avoid file reads from s3, faster!
Update:
Each column file(AKA block file) in Databend is a parquet file with one RowGroup, and the range reader is the RowGroup data, which is the entire parquet file except the footer.
Question
To avoid writing through blocks to disk(which would affect reads), we should use a Memory + Disk LRU cache, such as memory size: 1GB, and Disk size: 10GB, and make the block write-back async.
At last, I'd like to show the performance gains Snowflake has made thanks to caching for the hits dataset test:

Run Q1: SELECT COUNT() FROM hits.public.hits2;
Then Q2: SELECT COUNT() FROM hits.public.hits2 WHERE AdvEngineID <> 0;
The text was updated successfully, but these errors were encountered: