Fully exhaust datapipes that are needed to construct a dataset #6076

pmeier · 2022-05-24T09:37:56Z

Failures that will be fixed by this PR are only visible on Python >= 3.8. See #6065 for details.

Some datasets store a file inside an archive that needs to be read completely in order to construct the final datapipe. While it is tempting to do this in one Demultiplexer, this has the downside that (depending on how the archive is structured), the Demultiplexer now has items in the buffer before we actually start iterating on the final datapipe.

To avoid that we should always exhaust datapipes completely in case we need them while constructing another. This means, there are two changes to be made:

Remove the file we need to read from the classify_fn of the Demultiplexer and use a Filter to extract it separately.
Replace next(iter(dp)) with list(dp) to make sure dp is exhausted.

pmeier · 2022-05-24T10:39:16Z

After some offline discussion with @NicolasHug, it became clear that the situation is not as straight forward as I thought it was. For example, the child datapipes from the Demultiplexer seem to be aware if an iteration is ongoing.

@ejguan @NivekT @VitalyFedyunin what is the correct way here?

ejguan · 2022-05-24T13:53:12Z

IMHO, it depends on your Dataset:

If you want to pre-load labels or all meta data before iteration, I think it might be better to use filter for label_dp and image_dp and deplete label_dp before iteration starts:

label_dp = resource_dp.filter(is_meta_file)
labels = list(label_dp)

image_dp = resource_dp.filter(is_image_file)

If meta data or labels can be loaded lazily, IterToMap might be a better solution here. The meta data is going to be loaded lazily. And, zip between image_dp and label_dp.
- to_map_datapipe: https://github.com/pytorch/data/blob/13b574c80e8732744fee6ab9cb7e35b5afc34a3c/torchdata/datapipes/iter/util/converter.py#L21
- zip_with_map: https://github.com/pytorch/data/blob/13b574c80e8732744fee6ab9cb7e35b5afc34a3c/torchdata/datapipes/iter/util/combining.py#L148-L149

label_dp, image_dp = resource_dp.demux(classify_fn, 2)

label_dict = label_dp. to_map_datapipe()
image_dp.zip_with_map(label_dict)
...

NivekT · 2022-05-24T21:43:06Z

After some offline discussion with @NicolasHug, it became clear that the situation is not as straight forward as I thought it was. For example, the child datapipes from the Demultiplexer seem to be aware if an iteration is ongoing.>

That only matters if you have read one of your child DataPipes eagerly during the construction (it seems that is no longer the case after your code change in this PR since it no longer calls demux in cases where labels are eagerly needed).

NivekT

I do agree with @ejguan's points. Is it possible to use MapDataPipe in place of the dict?

pmeier · 2022-05-26T08:07:11Z

So for my own understanding: the result of IterToMapConverter is something similar to what I proposed in #5219, right? Basically we get a dictionary that will only be filled at runtime, correct?

@ejguan's proposal was to use the map together with a MapKeyZipper. While playing with that I found that it is quite inconvenient since we cannot use a lambda or local function for the merge_fn. I preferred to use the map directly with

dp = ...
map = IterToMapConverter(dp)

other_dp = ...
other_dp = Mapper(other_dp, map.__getitem__, input_col=...)

Do you see any downside with that?

Regardless of the approach chosen, I found that using IterToMapConverter introduces a performance degradation. Take the following snippet:

from time import perf_counter

import torch
from torchvision.prototype import datasets

for name, config in [
    ("imagenet", dict(split="val")),
    ("cub200", dict(year="2011")),
]:
    warmup_times = []
    for _ in range(5):
        tic = perf_counter()
        for sample in datasets.load(name, **config):
            break
        tac = perf_counter()

        warmup_times.append(tac - tic)

    print(f"Warmup for {name} took on average {float(torch.tensor(warmup_times).mean()):.2f} seconds")

Running this on main prints

Warmup for imagenet took on average 2.53 seconds
Warmup for cub200 took on average 2.97 seconds

Running it on this PR in the current state prints

Warmup for imagenet took on average 2.77 seconds
Warmup for cub200 took on average 3.64 seconds

Any idea what could cause this? I'm aware that pytorch/data#454 proposes a speed-up, but the implementation on main currently also loads the full dictionary at once. Thus, this is not the cause here.

ejguan · 2022-05-26T13:38:24Z

torchvision/prototype/datasets/_builtin/cub200.py


            bounding_boxes_dp = CSVParser(bounding_boxes_dp, dialect="cub200")
-            bounding_boxes_dp = Mapper(bounding_boxes_dp, image_files_map.get, input_col=0)
+            bounding_boxes_dp = Mapper(bounding_boxes_dp, image_files_map.__getitem__, input_col=0)


This is a good trick to have the similar behavior as zip_with_map
cc: @NivekT

NicolasHug · 2022-08-30T10:13:28Z

@pmeier you might be interested in #6128 which IIRC attemps to fix this issue as well

pmeier

This PR indeed solves the issue reported in #6515. The CI https://github.com/pytorch/vision/runs/8089441510 ran before the fresh nightly hit. There are still errors, but they are unrelated and reported in pytorch/pytorch#80267 (comment).

Maybe we can have another go at this PR given that it has a much narrower scope than #6128? cc @NivekT @ejguan

NivekT

I agree with your PR description:

I agree that we don't want the buffer of demux to have anything in it prior to the start of the iteration of the final DataPipe.
Fully exhaust the DataPipe (so that it resets the next time it starts)or avoid using demux should prevent the issue stated in 1.

If that is the goal and then LGTM. We can check if the buffer is empty before the next start if demux used in both iteration.

I do have a question:
Is the issue related to ResourceWarning: unclosed (which only happens for Python 3.9 Windows) caused by what you described above, or is that separate? I think #6128 is trying to fix that.

pmeier · 2022-08-30T19:55:32Z

Is the issue related to ResourceWarning: unclosed (which only happens for Python 3.9 Windows) caused by what you described above, or is that separate? I think #6128 is trying to fix that.

Yes, this is related. Although #6128 patches more things, I think this PR is sufficient to get rid of the warnings, given that they were always related to items left in a Demultiplexer.

This reverts commit d8f3d07.

This reverts commit 456dcf0.

ejguan

Overall LGTM!

torchvision/prototype/datasets/_builtin/imagenet.py

…et (#6076) Reviewed By: jdsgomes Differential Revision: D39543282 fbshipit-source-id: c43b9bc0acde33e9b2aa56402dae69a47ccd22d2

pmeier added 2 commits May 24, 2022 11:15

try fix cub dataset

e35c6e2

fix ImageNet

31badb7

pmeier added module: datasets prototype labels May 24, 2022

pmeier requested a review from NicolasHug May 24, 2022 09:37

facebook-github-bot added the cla signed label May 24, 2022

pmeier mentioned this pull request May 24, 2022

expand prototype test matrix to different Python versions #6065

Open

streamline imagenet

3d96754

NivekT reviewed May 24, 2022

View reviewed changes

pmeier added 3 commits May 26, 2022 09:44

Merge branch 'main' into datasets-pre-iter

f8b24b4

revert changes

3d2ae08

use map datapipe instead

b77c5b8

pmeier requested review from NivekT and ejguan May 26, 2022 12:40

ejguan reviewed May 26, 2022

View reviewed changes

pmeier mentioned this pull request Aug 30, 2022

Add tests for prototype <-> legacy transforms consistency #6514

Merged

Merge branch 'main' into datasets-pre-iter

e900d08

pmeier mentioned this pull request Aug 30, 2022

prototype datasets tests are failing #6515

Closed

pmeier mentioned this pull request Aug 30, 2022

[WIP] Validating input_col for certain datapipes pytorch/pytorch#80267

Closed

pmeier commented Aug 30, 2022

View reviewed changes

NivekT reviewed Aug 30, 2022

View reviewed changes

pmeier added 3 commits August 31, 2022 08:36

[DEBUG] run tests on full CI matrix

d8f3d07

[SKIP CI] CircleCI

797fa81

[SKIP CI] add temp fix for unnecessary strict torchdata check

456dcf0

pmeier added 2 commits September 13, 2022 10:09

Revert "[DEBUG] run tests on full CI matrix"

b961415

This reverts commit d8f3d07.

Revert "[SKIP CI] add temp fix for unnecessary strict torchdata check"

e6cf29d

This reverts commit 456dcf0.

ejguan mentioned this pull request Sep 13, 2022

[proto] Enable lazy loading for the data pipeline of CUB200 and Imagenet #6569

Closed

ejguan approved these changes Sep 13, 2022

View reviewed changes

torchvision/prototype/datasets/_builtin/imagenet.py Outdated Show resolved Hide resolved

torchvision/prototype/datasets/_builtin/imagenet.py Outdated Show resolved Hide resolved

pmeier added 3 commits September 13, 2022 15:39

use int key

5217e2d

remove unused function

c057cf3

Merge branch 'main' into datasets-pre-iter

d45544f

pmeier merged commit b4686f2 into pytorch:main Sep 13, 2022

pmeier deleted the datasets-pre-iter branch September 13, 2022 14:27

facebook-github-bot pushed a commit that referenced this pull request Sep 15, 2022

[fbsync] Fully exhaust datapipes that are needed to construct a datas…

982e313

…et (#6076) Reviewed By: jdsgomes Differential Revision: D39543282 fbshipit-source-id: c43b9bc0acde33e9b2aa56402dae69a47ccd22d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fully exhaust datapipes that are needed to construct a dataset #6076

Fully exhaust datapipes that are needed to construct a dataset #6076

pmeier commented May 24, 2022 •

edited

Loading

pmeier commented May 24, 2022

ejguan commented May 24, 2022

NivekT commented May 24, 2022

NivekT left a comment

pmeier commented May 26, 2022

ejguan May 26, 2022 •

edited

Loading

NicolasHug commented Aug 30, 2022

pmeier left a comment

NivekT left a comment

pmeier commented Aug 30, 2022

ejguan left a comment

Fully exhaust datapipes that are needed to construct a dataset #6076

Fully exhaust datapipes that are needed to construct a dataset #6076

Conversation

pmeier commented May 24, 2022 • edited Loading

pmeier commented May 24, 2022

ejguan commented May 24, 2022

NivekT commented May 24, 2022

NivekT left a comment

Choose a reason for hiding this comment

pmeier commented May 26, 2022

ejguan May 26, 2022 • edited Loading

Choose a reason for hiding this comment

NicolasHug commented Aug 30, 2022

pmeier left a comment

Choose a reason for hiding this comment

NivekT left a comment

Choose a reason for hiding this comment

pmeier commented Aug 30, 2022

ejguan left a comment

Choose a reason for hiding this comment

pmeier commented May 24, 2022 •

edited

Loading

ejguan May 26, 2022 •

edited

Loading