Closing streams to avoid testing issues #6128

VitalyFedyunin · 2022-06-03T03:52:07Z

Stack from ghstack (oldest at bottom):

-> Closing streams to avoid testing issues #6128

GC workaround is not working inside of test environments as they are keeping references to FH. So we have to close FH manually.

[ghstack-poisoned]

ghstack-source-id: 37ba421 Pull Request resolved: #6128

ejguan · 2022-06-03T13:47:55Z

torchvision/prototype/datasets/_builtin/sbd.py

@@ -111,6 +114,8 @@ def _datapipe(self, resource_dps: List[IterDataPipe]) -> IterDataPipe[Dict[str,
            drop_none=True,
        )
        if self._split == "train_noval":
+            for i in split_dp:
+                StreamWrapper.cleanup_structure(i)


Noob question: What is the functionality of clean_structure?

As a topic for discussion. When GC is not helping, we have to manually close streams. I have some prototypes/ideas how we can add debug info to find such leftovers. [ghstack-poisoned]

ghstack-source-id: 53f6ff4 Pull Request resolved: #6128

As a topic for discussion. When GC is not helping, we have to manually close streams. I have some prototypes/ideas how we can add debug info to find such leftovers. [ghstack-poisoned]

ghstack-source-id: a3dd9fe Pull Request resolved: #6128

As a topic for discussion. When GC is not helping, we have to manually close streams. I have some prototypes/ideas how we can add debug info to find such leftovers. [ghstack-poisoned]

ghstack-source-id: 26cdc15 Pull Request resolved: #6128

As a topic for discussion. When GC is not helping, we have to manually close streams. I have some prototypes/ideas how we can add debug info to find such leftovers. [ghstack-poisoned]

ghstack-source-id: 0f26c55 Pull Request resolved: #6128

pmeier

Just dropping in to add some context. Note that the spurious errors were not visible on Python 3.7, which is currently the only version our CI tests against. Either merge #6065 first or at least activate the other versions temporarily to see if this PR actually fixes them.

pmeier · 2022-07-06T19:57:53Z

torchvision/prototype/datasets/_builtin/caltech.py

@@ -107,7 +107,9 @@ def _prepare_sample(
        ann_path, ann_buffer = ann_data

        image = EncodedImage.from_file(image_buffer)
+        image_buffer.close()


The errors we have seen in our test suite have never been with these files, but only with archives.

Tests complain that archive stream is not closed. This is because child (unpacked file stream) also remains open and referencing parent. In pytorch/pytorch#78952 and pytorch/data#560 we introduced mechanism to close parent steams when every child is closed.

As a topic for discussion. When GC is not helping, we have to manually close streams. I have some prototypes/ideas how we can add debug info to find such leftovers. [ghstack-poisoned]

ghstack-source-id: dc3e422 Pull Request resolved: #6128

GC workaround is not working inside of test environments as they are keeping references to FH. So we have to close FH manually. [ghstack-poisoned]

ghstack-source-id: 150ffd7 Pull Request resolved: #6128

GC workaround is not working inside of test environments as they are keeping references to FH. So we have to close FH manually. [ghstack-poisoned]

ghstack-source-id: 4d59295 Pull Request resolved: #6128

VitalyFedyunin · 2022-09-21T18:00:14Z

Rebased, would be nice to land to clean torchdata's dependency tests.

GC workaround is not working inside of test environments as they are keeping references to FH. So we have to close FH manually. [ghstack-poisoned]

ghstack-source-id: 6d0e6f1 Pull Request resolved: #6128

GC workaround is not working inside of test environments as they are keeping references to FH. So we have to close FH manually. [ghstack-poisoned]

ghstack-source-id: ab56ae4 Pull Request resolved: #6128

pmeier

Thanks for the effort @VitalyFedyunin! I left some questions and suggestions inline. If we have reached consensus on everything, I can take over and implement it if you want me to.

pmeier · 2022-09-22T09:36:03Z

torchvision/prototype/datasets/_builtin/caltech.py

+        image_buffer.close()
        ann = read_mat(ann_buffer)
+        ann_buffer.close()


Instead of doing that in every dataset individually, can't we just do it in

vision/torchvision/prototype/utils/_internal.py

Line 48 in 658ca53

def fromfile(

and

vision/torchvision/prototype/datasets/utils/_internal.py

Line 37 in 658ca53

def read_mat(buffer: BinaryIO, **kwargs: Any) -> Any:

? I think so far we don't have a case where we need to read from the same file handle twice. Plus, that would only work if the stream is seekable, which I don't know if we can guarantee.

pmeier · 2022-09-22T09:36:42Z

torchvision/prototype/datasets/_builtin/celeba.py

@@ -29,8 +29,8 @@ def __init__(
        self.fieldnames = fieldnames

    def __iter__(self) -> Iterator[Tuple[str, Dict[str, str]]]:
-        for _, file in self.datapipe:
-            file = (line.decode() for line in file)
+        for _, fh in self.datapipe:


I'm ok with the closing here, but why the rename? Can you revert that?

pmeier · 2022-09-22T09:45:05Z

torchvision/prototype/datasets/_builtin/clevr.py

+            for i in scenes_dp:
+                janitor(i)


Can we make the loop variable more expressive?

Can we use torchdata.janitor instead to make it more clear where this is coming from?

Suggested change

for i in scenes_dp:

janitor(i)

for _, file in scenes_dp:

janitor(file)

Plus, do we even need to use torchdata.janitor here? Would just .close() by sufficient?

Suggested change

for i in scenes_dp:

janitor(i)

for _, file in scenes_dp:

file.close()

pmeier · 2022-09-22T09:45:42Z

torchvision/prototype/datasets/_builtin/coco.py

@@ -182,9 +184,11 @@ def _prepare_sample(
        anns, image_meta = ann_data

        sample = self._prepare_image(image_data)
+


Could you revert the formatting changes?

pmeier · 2022-09-22T09:46:20Z

torchvision/prototype/datasets/_builtin/coco.py

@@ -169,9 +170,10 @@ def _classify_meta(self, data: Tuple[str, Any]) -> Optional[int]:

    def _prepare_image(self, data: Tuple[str, BinaryIO]) -> Dict[str, Any]:
        path, buffer = data
+        image = close_buffer(EncodedImage.from_file, buffer)


If EncodedImage.from_file closes automatically we also don't need this wrapper.

pmeier · 2022-09-22T09:48:56Z

torchvision/prototype/datasets/_builtin/sbd.py

+            for i in split_dp:
+                janitor(i)


Same as above.

Plus, don't we need to do the same on extra_split_dp in the else branch?

pmeier · 2022-09-22T09:51:47Z

torchvision/prototype/datasets/utils/_internal.py



+def close_buffer(fn: Callable, buffer: IO) -> Any:


I think this was only used once and can be superseded if our read functions clean up after themselves. So this can probably be removed.

pmeier · 2022-09-22T09:56:23Z

test/test_prototype_datasets_builtin.py

        try:
-            sample = next(iter(dataset))
+            iterator = iter(dataset)


Instead of sticking with the iterator pattern here, can't we just simply do

samples = list(dataset) if not samples: raise AssertionError(...) sample = samples[0] ...

pmeier · 2022-09-22T09:56:56Z

test/test_prototype_datasets_builtin.py

+        iterator = iter(dataset)
+        one_element = next(iterator)


Same as above.

pmeier · 2022-09-22T10:03:06Z

test/test_prototype_datasets_builtin.py

+        if len(StreamWrapper.session_streams) > 0:
+            raise Exception(StreamWrapper.session_streams)


Could you explain what this does? Is StreamWrapper.session_streams just a counter for open streams? If yes, why are we only testing this here and not in the other tests? If this is something we should check in general, we can use a decorator like

def check_unclosed_streams(test_fn): @functools.wraps(test_fn) def wrapper(*args, **kwargs): if len(StreamWrapper.session_streams) > 0: raise pytest.UsageError("Some previous test didn't clean up") test_fn(*args, **kwargs) if len(StreamWrapper.session_streams) > 0: raise Assertion("This test didn't clean up") return wrapper

pmeier · 2022-09-26T14:01:09Z

If we have reached consensus on everything, I can take over and implement it if you want me to.

@VitalyFedyunin In #6647 I redid this PR with all my suggested changes. We can take the discussion there if you want.

facebook-github-bot · 2022-12-02T16:08:31Z

Hi @VitalyFedyunin!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

pmeier · 2022-12-02T17:04:10Z

Superseded by #6647.

[WIP] Closing streams"

b6ae4df

[ghstack-poisoned]

facebook-github-bot added the cla signed label Jun 3, 2022

VitalyFedyunin added a commit that referenced this pull request Jun 3, 2022

[WIP] Closing streams"

b7896b7

ghstack-source-id: 37ba421 Pull Request resolved: #6128

VitalyFedyunin changed the title ~~[WIP] Closing streams"~~ [WIP] Closing streams Jun 3, 2022

VitalyFedyunin requested review from ejguan and NicolasHug June 3, 2022 03:54

ejguan reviewed Jun 3, 2022

View reviewed changes

Update on "[WIP] Closing streams"

05caddc

As a topic for discussion. When GC is not helping, we have to manually close streams. I have some prototypes/ideas how we can add debug info to find such leftovers. [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request Jun 3, 2022

[WIP] Closing streams"

17d48d9

ghstack-source-id: 53f6ff4 Pull Request resolved: #6128

Update on "[WIP] Closing streams"

f87e97d

As a topic for discussion. When GC is not helping, we have to manually close streams. I have some prototypes/ideas how we can add debug info to find such leftovers. [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request Jun 6, 2022

[WIP] Closing streams"

e77e893

ghstack-source-id: a3dd9fe Pull Request resolved: #6128

Update on "[WIP] Closing streams"

a830b5d

As a topic for discussion. When GC is not helping, we have to manually close streams. I have some prototypes/ideas how we can add debug info to find such leftovers. [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request Jul 5, 2022

[WIP] Closing streams

79ec26b

ghstack-source-id: 26cdc15 Pull Request resolved: #6128

Update on "[WIP] Closing streams"

ab5e918

As a topic for discussion. When GC is not helping, we have to manually close streams. I have some prototypes/ideas how we can add debug info to find such leftovers. [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request Jul 5, 2022

[WIP] Closing streams

dc46c09

ghstack-source-id: 0f26c55 Pull Request resolved: #6128

VitalyFedyunin mentioned this pull request Jul 6, 2022

Is our handling of open files safe? pytorch/data#436

Closed

pmeier reviewed Jul 6, 2022

View reviewed changes

Update on "[WIP] Closing streams"

7b0613c

As a topic for discussion. When GC is not helping, we have to manually close streams. I have some prototypes/ideas how we can add debug info to find such leftovers. [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request Jul 19, 2022

[WIP] Closing streams

7f37d7e

ghstack-source-id: dc3e422 Pull Request resolved: #6128

VitalyFedyunin changed the title ~~[WIP] Closing streams~~ Closing streams to avoid testing issues Jul 19, 2022

Update on "Closing streams to avoid testing issues"

d3cfb2a

GC workaround is not working inside of test environments as they are keeping references to FH. So we have to close FH manually. [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request Jul 19, 2022

[WIP] Closing streams

6f1d645

ghstack-source-id: 150ffd7 Pull Request resolved: #6128

NicolasHug mentioned this pull request Aug 30, 2022

Fully exhaust datapipes that are needed to construct a dataset #6076

Merged

Update on "Closing streams to avoid testing issues"

8fc5551

GC workaround is not working inside of test environments as they are keeping references to FH. So we have to close FH manually. [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request Sep 21, 2022

Closing streams

f61e24b

ghstack-source-id: 4d59295 Pull Request resolved: #6128

VitalyFedyunin requested a review from pmeier September 21, 2022 17:59

Update on "Closing streams to avoid testing issues"

6a72184

GC workaround is not working inside of test environments as they are keeping references to FH. So we have to close FH manually. [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request Sep 21, 2022

Closing streams

8ff2d07

ghstack-source-id: 6d0e6f1 Pull Request resolved: #6128

Update on "Closing streams to avoid testing issues"

f63d6e2

GC workaround is not working inside of test environments as they are keeping references to FH. So we have to close FH manually. [ghstack-poisoned]

VitalyFedyunin added a commit that referenced this pull request Sep 21, 2022

Closing streams

af379fb

ghstack-source-id: ab56ae4 Pull Request resolved: #6128

pmeier reviewed Sep 22, 2022

View reviewed changes

pmeier mentioned this pull request Sep 26, 2022

close streams in prototype datasets #6647

Merged

pmeier closed this Dec 2, 2022

datumbox deleted the gh/VitalyFedyunin/1/head branch December 2, 2022 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closing streams to avoid testing issues #6128

Closing streams to avoid testing issues #6128

VitalyFedyunin commented Jun 3, 2022 •

edited

Loading

ejguan Jun 3, 2022

pmeier left a comment

pmeier Jul 6, 2022

VitalyFedyunin Jul 6, 2022

VitalyFedyunin commented Sep 21, 2022

pmeier left a comment •

edited

Loading

pmeier Sep 22, 2022

pmeier Sep 22, 2022

pmeier Sep 22, 2022

pmeier Sep 22, 2022

pmeier Sep 22, 2022

pmeier Sep 22, 2022

pmeier Sep 22, 2022

pmeier Sep 22, 2022

pmeier Sep 22, 2022

pmeier Sep 22, 2022

pmeier Sep 22, 2022

pmeier commented Sep 26, 2022

facebook-github-bot commented Dec 2, 2022

pmeier commented Dec 2, 2022

		@@ -182,9 +184,11 @@ def _prepare_sample(
		anns, image_meta = ann_data

		sample = self._prepare_image(image_data)

		if len(StreamWrapper.session_streams) > 0:
		raise Exception(StreamWrapper.session_streams)

Closing streams to avoid testing issues #6128

Closing streams to avoid testing issues #6128

Conversation

VitalyFedyunin commented Jun 3, 2022 • edited Loading

Choose a reason for hiding this comment

pmeier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VitalyFedyunin commented Sep 21, 2022

pmeier left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmeier commented Sep 26, 2022

facebook-github-bot commented Dec 2, 2022

Process

pmeier commented Dec 2, 2022

VitalyFedyunin commented Jun 3, 2022 •

edited

Loading

pmeier left a comment •

edited

Loading