Skip to content

Dynamic commit sizes in upload-large-folder #3010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Wauplin opened this issue Apr 16, 2025 · 2 comments
Closed

Dynamic commit sizes in upload-large-folder #3010

Wauplin opened this issue Apr 16, 2025 · 2 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@Wauplin
Copy link
Contributor

Wauplin commented Apr 16, 2025

Originally on slack (private link) by @Wauplin :

they seem to be using upload-large-folder but still it's a lot of commits, not sure if they can optimize more?)

Yes we definitely can (and should!) optimize this better. We currently make commits by chunks of 50 files. For a repo with this structure (60k+ small LFS files), it leads to hundreds of commits. Since LFS uploads are nearly instant (seems like ~20MB per file on avg?), we run into the 128 commits/hour limit quite quickly. The 50 files/commit is a purely empiric rule that I've set to avoid timeouts during commits. Moon-landing is able to handle more than that but for most repos 50/commit is not a problem. Empirically it looks like the maximum number of files per commit depends on the git history (many files + many commits => more timeouts).

(...) What we can do is:
Make the number of files per commit dynamic => if the queue is very long (like in this case), try with 100 files. If it succeeds, try with 150. If it fails, try with 70. No need for super-complex rules, just a scale [20, 50, 75, 100, 125, 150, 200, 250, 400, 600, 1000] should be enough (start at 100 and move up/down depending on success/failure). We can also measure the commit time and if it's above 45s, do not increase the number of files. Since timeouts happen at 60s IIUC.

So basically, suggestion is to update upload_large_folder logic here

return (WorkerJob.GET_UPLOAD_MODE, _get_n(status.queue_get_upload_mode, 50))

so that instead of having a hardcoded 50, we do a dynamic one based on the scale [20, 50, 75, 100, 125, 150, 200, 250, 400, 600, 1000]. Should start at 50. If commit has a timeout failure, decrease number of files. If call succeeds and takes less that 40s, increase number of files. Otherwise, keep same number. Hardcoded rule used (hardcoded also here)

@Wauplin Wauplin added enhancement New feature or request good first issue Good for newcomers labels Apr 16, 2025
@maximizemaxwell
Copy link
Contributor

I'd like to work for this

@hanouticelina
Copy link
Contributor

closes as completed by #3016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants