Skip to content

feat: save copy into table stage file meta, avoid duplicate copy file #7531

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

feat: save copy into table stage file meta, avoid duplicate copy file #7531

wants to merge 3 commits into from

Conversation

lichuang
Copy link
Contributor

@lichuang lichuang commented Sep 8, 2022

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

  1. introduce TableStageFileInfo in meta to save copy into table stage file info, it will be expired after 64 days.
  2. avoid duplicate copy file operation with compare TableStageFileInfo.md5.
  3. add proc compatible test

Fixes #6338

@vercel
Copy link

vercel bot commented Sep 8, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
databend ✅ Ready (Inspect) Visit Preview Sep 8, 2022 at 10:39AM (UTC)

@lichuang lichuang requested review from Xuanwo and BohuTANG September 8, 2022 10:38
@mergify mergify bot added the pr-feature this PR introduces a new feature to the codebase label Sep 8, 2022
Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is a bit complex that mixed with meta, catalog and query.

Can you give an RFC first so we can understand how and why we address the issue in this way?


if let Some(file_info) = resp.file_info.get(file) {
// No need to copy the file again only if md5 match.
if file_info.md5 == stage_file.md5 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check the following things:

  • name
  • length
  • last modified
  • etag (use etag instead of md5)

Copy link
Member

@BohuTANG BohuTANG Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the location is https(not s3), the etag is?

Copy link
Member

@Xuanwo Xuanwo Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the location is https(not s3), the etag is?

ETag in OpenDAL is the same ETag defined in HTTP Standard: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag

They share the same semantics, which is reliable in detecting content changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type StageFile, which save stage file meta info, has no etag field:

#[derive(Default, Clone)]
pub struct StageFile {
    pub path: String,
    pub size: u64,
    pub md5: Option<String>,
    pub last_modified: DateTime<Utc>,
    pub creator: Option<UserIdentity>,
}

Copy link
Member

@drmingdrmer drmingdrmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not finish reading all of the code.
The first of my advices is to keep UpsertKVReq::udpate() as it is.
And use UpsertKVReq::udpate().with(KVMeta {expire_at: Some(now + 2)} to update the expire time if needed.

This way it does not have to modify all the usages of update().

Ref: https://github.com/datafuselabs/databend/blob/18f90a3ac665ff0e8b9f960731fbcfe6ed11c47f/src/meta/api/src/kv_api_test_suite.rs#L278-L285

Comment on lines +3182 to +3199
let cur_db = mt.get_database(Self::req_get_db(tenant, db_name)).await?;
assert!(old_db.ident.seq < cur_db.ident.seq);
assert!(res.table_id >= 1, "table id >= 1");
let tb_id = res.table_id;

let got = mt.get_table((tenant, db_name, tbl_name).into()).await?;
let seq = got.ident.seq;

let ident = TableIdent::new(tb_id, seq);

let want = TableInfo {
ident: ident.clone(),
desc: format!("'{}'.'{}'.'{}'", tenant, db_name, tbl_name),
name: tbl_name.into(),
meta: table_meta(created_on),
};
assert_meta_eq_without_updated!(want, got.as_ref().clone(), "get created table");
ident
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not have to test create-table. Creating a table is already guaranteed to work as expected in other cases.

@lichuang
Copy link
Contributor Author

lichuang commented Sep 9, 2022

@drmingdrmer @BohuTANG @Xuanwo

I split the original issue into sub-issues, cause I think this pr is too large, including meta and query changes, and need some help of @Xuanwo to add the etag field of StageFile, so I close this issue, but your review advice will bring into the sub-issue pr, thank you.

@lichuang lichuang closed this Sep 9, 2022
@lichuang lichuang deleted the table_stage_file_duplicate branch September 20, 2022 02:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stream COPY from a directory
4 participants