Skip to content

Do not fetch /preupload if already done in upload-large-folder #3100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 22, 2025

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented May 21, 2025

Originally from @Kakulukian on slack (private):

From the end of last weekend until yesterday, we experienced increased latency on the hub. This appears to be connected to how the upload_large_folder method works. I've identified that one user uploaded approximately ~200,000 files using this method.
According to my understanding of the code, uploading via this method requires a call to the preupload endpoint for each file (right?). The issue is that as a repository grows more significant (commits or files), this method becomes progressively slower, particularly when the calls aren't batched
Would it be possible to also implement file batching for this endpoint with a minimum threshold?

@Wauplin Wauplin requested a review from hanouticelina May 21, 2025 14:35
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@hanouticelina hanouticelina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly the changes I was going to commit 😄
thank you!

@Wauplin Wauplin marked this pull request as draft May 21, 2025 14:59
@Wauplin Wauplin requested a review from hanouticelina May 21, 2025 15:25
@Wauplin Wauplin marked this pull request as ready for review May 21, 2025 15:25
@Wauplin Wauplin changed the title Batch preupload calls in upload-large-folder Do not fetch /preupload if already done in upload-large-folder May 21, 2025
Copy link
Contributor

@hanouticelina hanouticelina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cood catch! I completely missed that we weren’t storing the upload modes too. This is a much more complete fix, thank you!
I tested it locally as well, it works as expected ✅

status.nb_workers_get_upload_mode += 1
logger.debug("Job: get upload mode (>10 files ready)")
return (WorkerJob.GET_UPLOAD_MODE, _get_n(status.queue_get_upload_mode, status.target_chunk()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need target_chunk() function anymore, right? I'm a bit confused, I don't recall exactly why we had it in the first place in #3016, the dynamic sizing was only meant for the commit job and it was not strictly necessary for the get upload mode job

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, we reviewed #3016 a bit too quickly 🙈 target_chunk must be used for commit chunks (but not done at the moment). Let's fix this in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in ce1e35d

@hanouticelina
Copy link
Contributor

I'm merging this, thanks again @Wauplin for the fix!

@hanouticelina hanouticelina merged commit 0f27bdf into main May 22, 2025
25 checks passed
@hanouticelina hanouticelina deleted the batch-preupload-calls-in-upload-large-folder branch May 22, 2025 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants