Skip to content

The TPU issues in Lightning #13720

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
awaelchli opened this issue Jul 18, 2022 · 1 comment · Fixed by #11098
Closed

The TPU issues in Lightning #13720

awaelchli opened this issue Jul 18, 2022 · 1 comment · Fixed by #11098
Labels
accelerator: tpu Tensor Processing Unit ci Continuous Integration tests
Milestone

Comments

@awaelchli
Copy link
Contributor

awaelchli commented Jul 18, 2022

🐛 Bug

Recent observations have made it clear that there are many problems with either the TPU implementation in Lightning or the test environment:

  1. Not all tests written in Lightning for TPU are executed. Only a hand-maintained list of tests ever runs. Fix TPU testing and collect all tests #11098
  2. Attempting to address 1) reveals further that among the tests that do run, there are many decorated with a wrapper @pl_multi_process_test, which suppresses assertion errors and exceptions of broken tests.

The result is that we have a lot of tests that are broken but never surface in the CI.

To Reproduce

A simple way to reproduce this is by removing all decorators, which is what I have done in #11098, and then let the tests run and fail. Attached is the full log file of such a CI run: tpu-logs-without-pl-multi.txt

In summary:
17 failed, 48 passed

FAILED tests/tests_pytorch/callbacks/test_device_stats_monitor.py::test_device_stats_monitor_tpu
FAILED tests/tests_pytorch/models/test_tpu.py::test_model_tpu_index[1] - Runt...
FAILED tests/tests_pytorch/models/test_tpu.py::test_model_tpu_index[5] - Runt...
FAILED tests/tests_pytorch/models/test_tpu.py::test_model_tpu_devices_8 - tor...
FAILED tests/tests_pytorch/models/test_tpu.py::test_model_16bit_tpu_index[1]
FAILED tests/tests_pytorch/models/test_tpu.py::test_model_16bit_tpu_index[5]
FAILED tests/tests_pytorch/models/test_tpu.py::test_model_16bit_tpu_devices_8
FAILED tests/tests_pytorch/models/test_tpu.py::test_model_tpu_early_stop - to...
FAILED tests/tests_pytorch/models/test_tpu.py::test_dataloaders_passed_to_fit
FAILED tests/tests_pytorch/models/test_tpu.py::test_broadcast_on_tpu - torch....
FAILED tests/tests_pytorch/models/test_tpu.py::test_tpu_reduce - torch.multip...
FAILED tests/tests_pytorch/models/test_tpu.py::test_if_test_works_with_checkpoint_false
FAILED tests/tests_pytorch/models/test_tpu.py::test_tpu_sync_dist - torch.mul...
FAILED tests/tests_pytorch/models/test_tpu.py::test_tpu_debug_mode - torch.mu...
FAILED tests/tests_pytorch/models/test_tpu.py::test_tpu_host_world_size - tor...
FAILED tests/tests_pytorch/profilers/test_xla_profiler.py::test_xla_profiler_instance
FAILED tests/tests_pytorch/trainer/properties/test_estimated_stepping_batches.py::test_num_stepping_batches_with_tpu[8-8]
ERROR tests/tests_pytorch/models/test_tpu.py::test_model_16bit_tpu_index[1]
ERROR tests/tests_pytorch/models/test_tpu.py::test_model_16bit_tpu_index[5]

There is of course the infamous cryptic error message for several test cases
Exception in device=TPU:2: Cannot replicate if number of devices (1) is different from 8

Which sometimes hints at the possibility that we are accessing the xm.xla_device before spawning processes.
Other examples:

self = <pytorch_lightning.trainer.connectors.accelerator_connector.AcceleratorConnector object at 0x7f7bbead03d0>

    @property
    def is_distributed(self) -> bool:
        # TODO: deprecate this property
        # Used for custom plugins.
        # Custom plugins should implement is_distributed property.
        if hasattr(self.strategy, "is_distributed") and not isinstance(self.accelerator, TPUAccelerator):
            return self.strategy.is_distributed
        distributed_strategy = (
            DDP2Strategy,
            DDPStrategy,
            DDPSpawnShardedStrategy,
            DDPShardedStrategy,
            DDPFullyShardedNativeStrategy,
            DDPFullyShardedStrategy,
            DDPSpawnStrategy,
            DeepSpeedStrategy,
            TPUSpawnStrategy,
            HorovodStrategy,
            HPUParallelStrategy,
        )
        is_distributed = isinstance(self.strategy, distributed_strategy)
        if isinstance(self.accelerator, TPUAccelerator):
>           is_distributed |= self.strategy.is_distributed
E           TypeError: unsupported operand type(s) for |=: 'bool' and 'NoneType'
    def has_len_all_ranks(
        dataloader: DataLoader,
        training_type: "pl.Strategy",
        model: Union["pl.LightningModule", "pl.LightningDataModule"],
    ) -> bool:
        """Checks if a given Dataloader has ``__len__`` method implemented i.e. if it is a finite dataloader or
        infinite dataloader."""
        try:
            local_length = len(dataloader)
            total_length = training_type.reduce(torch.tensor(local_length).to(model.device), reduce_op="sum")
    
>           if total_length == 0:
E           RuntimeError: Not found: From /job:tpu_worker/replica:0/task:0:
E           2 root error(s) found.
E             (0) Not found: No subgraph found for uid 2894109085761937038
E           	 [[{{node XRTExecute}}]]
E             (1) Not found: No subgraph found for uid 2894109085761937038
E           	 [[{{node XRTExecute}}]]
E           	 [[XRTExecute_G29]]
E           0 successful operations.
E           0 derived errors ignored.

Furthermore, sometimes, non-deterministically, the CI just stops in the middle of execution:

....
profilers/test_xla_profiler.py::test_xla_profiler_instance FAILED        [ 93%]
strategies/test_tpu_spawn.py::test_model_tpu_one_core PASSED   [ 95%]
Done with log retrieval attempt.

Exited with code exit status 2
CircleCI received exit code 2

Expected behavior

It is unclear what the intention was when designing the test setup. The decorators were introduced way back in #2512 and have never much changed. Meanwhile, strategy and accelerators have undergone major design changes and countless refactors. I propose to re-evaluate whether the pl_multi_process_test decorator is still needed, and if so, document why, how to use it and when to use it correctly.

Possible Action

My suggestion is to

  1. Remove the decorator
  2. Debug each test on the VM
  3. Run tests that require it in standalone mode
  4. Reduce the verbosity of the mind boggling thousands of nonsense lines printed in the CI
  5. Upgrade to the latest XLA and PyTorch version

cc @carmocca @akihironitta @Borda @kaushikb11 @rohitgr7

@awaelchli awaelchli added needs triage Waiting to be triaged by maintainers ci Continuous Integration accelerator: tpu Tensor Processing Unit and removed needs triage Waiting to be triaged by maintainers labels Jul 18, 2022
@carmocca carmocca added the tests label Jul 19, 2022
@carmocca carmocca added this to the future milestone Jul 19, 2022
@awaelchli
Copy link
Contributor Author

As a follow up, after the release, I'd like to update the TPU CI to the latest xla release, but will likely need help from more talented hands. #13818

@carmocca carmocca modified the milestones: pl:future, pl:1.7 Jul 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: tpu Tensor Processing Unit ci Continuous Integration tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants