Skip to content

Fix dynamic commit size #3016

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Apr 29, 2025

Conversation

maximizemaxwell
Copy link
Contributor

What does this PR do?

  • Changed logic in _upload_large_folder.py:
  • Add commit size scale list
COMMIT_SIZE_SCALE: list[int] = [20, 50, 75, 100, 125, 150, 200, 250, 400, 600, 1000]
  • Add methods in LargeUploadStatus for dynamic scaling
class LargeUploadStatus:
    """Contains information, queues and tasks for a large upload process."""

    def __init__(self, items: List[JOB_ITEM_T]):
        self.items = items
        self.queue_sha256: "queue.Queue[JOB_ITEM_T]" = queue.Queue()
        self.queue_get_upload_mode: "queue.Queue[JOB_ITEM_T]" = queue.Queue()
        self.queue_preupload_lfs: "queue.Queue[JOB_ITEM_T]" = queue.Queue()
        self.queue_commit: "queue.Queue[JOB_ITEM_T]" = queue.Queue()
        self.lock = Lock()

        self.nb_workers_sha256: int = 0
        self.nb_workers_get_upload_mode: int = 0
        self.nb_workers_preupload_lfs: int = 0
        self.nb_workers_commit: int = 0
        self.nb_workers_waiting: int = 0
        self.last_commit_attempt: Optional[float] = None

        self._started_at = datetime.now()
        self._chunk_idx: int = 1
        self._chunk_lock: Lock = Lock()

        # Setup queues
        for item in self.items:
            paths, metadata = item
            if metadata.sha256 is None:
                self.queue_sha256.put(item)
            elif metadata.upload_mode is None:
                self.queue_get_upload_mode.put(item)
            elif metadata.upload_mode == "lfs" and not metadata.is_uploaded:
                self.queue_preupload_lfs.put(item)
            elif not metadata.is_committed:
                self.queue_commit.put(item)
            else:
                logger.debug(f"Skipping file {paths.path_in_repo} (already uploaded and committed)")

    def target_chunk(self) -> int:
        with self._chunk_lock:
            return COMMIT_SIZE_SCALE[self._chunk_idx]

    def update_chunk(self, success: bool, duration: float) -> None:
        with self._chunk_lock:
            if not success and self._chunk_idx > 0:
                self._chunk_idx -= 1
            else:
                if duration < 40 and self._chunk_idx < len(COMMIT_SIZE_SCALE) - 1:
                    self._chunk_idx += 1
  • Changed hard coded parts to
return (WorkerJob.GET_UPLOAD_MODE, _get_n(status.queue_get_upload_mode, status.target_chunk()))

Related Issue

Dynamic commit sizes in upload-large-folder #3010

Review

@Wauplin could you review this code?

@maximizemaxwell maximizemaxwell marked this pull request as draft April 21, 2025 16:53
@maximizemaxwell maximizemaxwell marked this pull request as ready for review April 21, 2025 16:54
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Wauplin
Copy link
Contributor

Wauplin commented Apr 23, 2025

@bot /style

Copy link
Contributor

Style fixes have been applied. View the workflow run here.

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean PR @maximizemaxwell, thanks for opening it! Using a lock and the LargeUploadStatus object is an elegant solution :)

I've left a comment regarding the increase/decrease policy itself but otherwise looks good to me. Before getting this merged, could you create a new module tests/test_upload_large_folder.py and add a few tests for LargeUploadStatus.target_chunk/update_chunk (no need to upload/create any data, just having unit tests on these two). You can check a few cases (never below 0 or above scale, update only if commit was full, etc.). No need to use anything from unittest in the test module, just plain pytest stuff is great. You can check tests/test_serialization.py as an example. Let me know if you have any question!

@maximizemaxwell
Copy link
Contributor Author

Added some tests, could you review those? @Wauplin

Copy link
Contributor

@hanouticelina hanouticelina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @maximizemaxwell for the contribution! I left a small comment, but everything else looks great 🤗

Copy link
Contributor

@hanouticelina hanouticelina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @maximizemaxwell for the clean PR 🤗
let's wait for a final review from @Wauplin and then we will be able to merge

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! Let's get this merged 🎉

@Wauplin Wauplin merged commit 0709088 into huggingface:main Apr 29, 2025
21 checks passed
Wauplin added a commit that referenced this pull request May 21, 2025
hanouticelina pushed a commit that referenced this pull request May 22, 2025
* Batch preupload calls in upload-large-folder

* Revert "Batch preupload calls in upload-large-folder"

This reverts commit 73754c4.

* upload mode was not fetched

* type hints

* make quality again

* let fix #3016 as well
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants