Skip to content

Fully exhaust datapipes that are needed to construct a dataset #6076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Sep 13, 2022

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented May 24, 2022

Failures that will be fixed by this PR are only visible on Python >= 3.8. See #6065 for details.


Some datasets store a file inside an archive that needs to be read completely in order to construct the final datapipe. While it is tempting to do this in one Demultiplexer, this has the downside that (depending on how the archive is structured), the Demultiplexer now has items in the buffer before we actually start iterating on the final datapipe.

To avoid that we should always exhaust datapipes completely in case we need them while constructing another. This means, there are two changes to be made:

  1. Remove the file we need to read from the classify_fn of the Demultiplexer and use a Filter to extract it separately.
  2. Replace next(iter(dp)) with list(dp) to make sure dp is exhausted.

@pmeier
Copy link
Collaborator Author

pmeier commented May 24, 2022

After some offline discussion with @NicolasHug, it became clear that the situation is not as straight forward as I thought it was. For example, the child datapipes from the Demultiplexer seem to be aware if an iteration is ongoing.

@ejguan @NivekT @VitalyFedyunin what is the correct way here?

@ejguan
Copy link
Contributor

ejguan commented May 24, 2022

IMHO, it depends on your Dataset:

  1. If you want to pre-load labels or all meta data before iteration, I think it might be better to use filter for label_dp and image_dp and deplete label_dp before iteration starts:
label_dp = resource_dp.filter(is_meta_file)
labels = list(label_dp)

image_dp = resource_dp.filter(is_image_file)
  1. If meta data or labels can be loaded lazily, IterToMap might be a better solution here. The meta data is going to be loaded lazily. And, zip between image_dp and label_dp.
label_dp, image_dp = resource_dp.demux(classify_fn, 2)

label_dict = label_dp. to_map_datapipe()
image_dp.zip_with_map(label_dict)
...

@NivekT
Copy link
Contributor

NivekT commented May 24, 2022

After some offline discussion with @NicolasHug, it became clear that the situation is not as straight forward as I thought it was. For example, the child datapipes from the Demultiplexer seem to be aware if an iteration is ongoing.>

That only matters if you have read one of your child DataPipes eagerly during the construction (it seems that is no longer the case after your code change in this PR since it no longer calls demux in cases where labels are eagerly needed).

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree with @ejguan's points. Is it possible to use MapDataPipe in place of the dict?

@pmeier
Copy link
Collaborator Author

pmeier commented May 26, 2022

So for my own understanding: the result of IterToMapConverter is something similar to what I proposed in #5219, right? Basically we get a dictionary that will only be filled at runtime, correct?

@ejguan's proposal was to use the map together with a MapKeyZipper. While playing with that I found that it is quite inconvenient since we cannot use a lambda or local function for the merge_fn. I preferred to use the map directly with

dp = ...
map = IterToMapConverter(dp)

other_dp = ...
other_dp = Mapper(other_dp, map.__getitem__, input_col=...)

Do you see any downside with that?

Regardless of the approach chosen, I found that using IterToMapConverter introduces a performance degradation. Take the following snippet:

from time import perf_counter

import torch
from torchvision.prototype import datasets

for name, config in [
    ("imagenet", dict(split="val")),
    ("cub200", dict(year="2011")),
]:
    warmup_times = []
    for _ in range(5):
        tic = perf_counter()
        for sample in datasets.load(name, **config):
            break
        tac = perf_counter()

        warmup_times.append(tac - tic)

    print(f"Warmup for {name} took on average {float(torch.tensor(warmup_times).mean()):.2f} seconds")

Running this on main prints

Warmup for imagenet took on average 2.53 seconds
Warmup for cub200 took on average 2.97 seconds

Running it on this PR in the current state prints

Warmup for imagenet took on average 2.77 seconds
Warmup for cub200 took on average 3.64 seconds

Any idea what could cause this? I'm aware that pytorch/data#454 proposes a speed-up, but the implementation on main currently also loads the full dictionary at once. Thus, this is not the cause here.

@pmeier pmeier requested review from NivekT and ejguan May 26, 2022 12:40

bounding_boxes_dp = CSVParser(bounding_boxes_dp, dialect="cub200")
bounding_boxes_dp = Mapper(bounding_boxes_dp, image_files_map.get, input_col=0)
bounding_boxes_dp = Mapper(bounding_boxes_dp, image_files_map.__getitem__, input_col=0)
Copy link
Contributor

@ejguan ejguan May 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good trick to have the similar behavior as zip_with_map
cc: @NivekT

@NicolasHug
Copy link
Member

@pmeier you might be interested in #6128 which IIRC attemps to fix this issue as well

Copy link
Collaborator Author

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR indeed solves the issue reported in #6515. The CI https://github.com/pytorch/vision/runs/8089441510 ran before the fresh nightly hit. There are still errors, but they are unrelated and reported in pytorch/pytorch#80267 (comment).

Maybe we can have another go at this PR given that it has a much narrower scope than #6128? cc @NivekT @ejguan

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your PR description:

  1. I agree that we don't want the buffer of demux to have anything in it prior to the start of the iteration of the final DataPipe.
  2. Fully exhaust the DataPipe (so that it resets the next time it starts)or avoid using demux should prevent the issue stated in 1.

If that is the goal and then LGTM. We can check if the buffer is empty before the next start if demux used in both iteration.

I do have a question:
Is the issue related to ResourceWarning: unclosed (which only happens for Python 3.9 Windows) caused by what you described above, or is that separate? I think #6128 is trying to fix that.

@pmeier
Copy link
Collaborator Author

pmeier commented Aug 30, 2022

Is the issue related to ResourceWarning: unclosed (which only happens for Python 3.9 Windows) caused by what you described above, or is that separate? I think #6128 is trying to fix that.

Yes, this is related. Although #6128 patches more things, I think this PR is sufficient to get rid of the warnings, given that they were always related to items left in a Demultiplexer.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM!

@pmeier pmeier merged commit b4686f2 into pytorch:main Sep 13, 2022
@pmeier pmeier deleted the datasets-pre-iter branch September 13, 2022 14:27
facebook-github-bot pushed a commit that referenced this pull request Sep 15, 2022
…et (#6076)

Reviewed By: jdsgomes

Differential Revision: D39543282

fbshipit-source-id: c43b9bc0acde33e9b2aa56402dae69a47ccd22d2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants