CPU System Metrics collection #11253

EricWiener · 2021-12-24T18:09:12Z

🚀 Feature

Provide CPU profiling similar to GPU and XLA profiling provided by DeviceStatsMonitor. It would be nice if you could specify which device you wanted to profile with DevcieStatsMonitor vs. the profiling defaulting to whatever accelerator you are using.

Motivation

I am running out of CPU memory and I need to figure out where this is occurring. It would be nice if I could easily monitor CPU stats (memory usage, percent utilization, etc).

Pitch

Modify DevcieStatsMonitor to take a device arg that allows you to specify which device to profile. You can then pass multiple DeviceStatsMonitor callbacks to Trainer. The CPU Monitor can use psutil to track common memory attributes.

Alternatives

N/A

Additional context

Also discussed here: #9032 (comment)

cc @Borda @kaushikb11 @awaelchli @justusschock @akihironitta @rohitgr7

The text was updated successfully, but these errors were encountered:

ananthsub · 2021-12-24T19:51:53Z

@EricWiener would you prefer that seeing CPU stats automatically, even when training on GPU? And to confirm, do you want CPU stats tracked at the same cadence/hooks as GPU stats?

I retitled it the issue to be about system metrics collection to avoid confusion with profiling/Lightning profilers

EricWiener · 2021-12-24T20:00:05Z

@EricWiener would you prefer that seeing CPU stats automatically, even when training on GPU?

Automatic stats would be very nice, but it seems a little strange to require callbacks for GPU/XLA stats but have CPU stats automatically tracked. Also, if it were to be automatically tracked, it would be nice if iteration speed was tracked as well.

And to confirm, do you want CPU stats tracked at the same cadence/hooks as GPU stats?

Ideally one could specify the frequency (maybe by passing a list of the callbacks they want stats to be logged at). For debugging memory usage it would be nice for stats to be logged at every possible callback. However, for most cases, every n steps/epochs would likely suffice.

ananthsub · 2021-12-25T02:29:05Z

I should've clarified, by automatic I mean when using the device stats monitor callback, not on all the time.

One idea @daniellepintz and I discussed earlier was to do something like this:

class CPUAccelerator(Accelerator):

    _process: Optional[psutil.Process]  # check if psutil available

    def setup_environment(self, root_device: torch.device) -> None:
        """
        Raises:
            MisconfigurationException:
                If the selected device is not CPU.
        """
        if "cpu" not in str(root_device):
            raise MisconfigurationException(f"Device should be CPU, got {root_device} instead.")
        self._process = psutil.Process()
    
    def teardown(self) -> None:
        self._process = None
    
    def get_device_stats(self, device: Union[str, torch.device]) -> Dict[str, Any]:
        """CPU device stats aren't supported yet."""
        if not self._process:
            return {}
        return get_cpu_process_metrics(self._process)

def get_cpu_process_metrics(process: Optional[psutil.Process]) -> Dict[str, float]:
    process = process or psutil.Process()
    memory_info = process.memory_info()
    cpu_times = process.cpu_times()
    metrics["cpu_rss_memory_bytes"] = memory_info.rss
    metrics["cpu_time_user"] = cpu_times.user
    metrics["cpu_time_system"] = cpu_times.system
    return metrics

get_cpu_process_metrics could also be called from the GPU and TPU accelerators too as part of their get_device_stats implementations. Anytime someone attaches a device stats monitor callback, it would generate both the CPU + specific device stats being used.

@four4fish @awaelchli this would motivate teardown being added to the accelerator interface. as a rule of thumb, anytime we offer a setup interface, we should also provide a teardown since they come in pairs

EricWiener · 2021-12-25T14:27:29Z

Having CPU stats whenever the device stats monitor is used would be great.

It would also be ideal if the user could specify additional CPU metrics they wanted (like swap memory percent)

tchaton · 2022-01-04T12:44:23Z

Hey @ananthsub,

Users might want to track both their CPU + Accelerator Device (GPU, TPU, ...) usages at the same time which is a very common use case (done automatically by Wandb as an example).

However, the current design relies on single accelerator to be instantiated.

Do you have any idea how to resolve this?

IMO, as a user, I would prefer an interface like this:

# opt-in when the selected accelerator isn't 'cpu'.
Trainer(accelerator="gpu", devices=2, callbacks=DeviceStatsMonitor(cpu_stats=True))

class DeviceStatsMonitor:
     ...

    def on_train_batch_end(
        self,
        trainer: "pl.Trainer",
        pl_module: "pl.LightningModule",
        outputs: STEP_OUTPUT,
        batch: Any,
        batch_idx: int,
        unused: Optional[int] = 0,
    ) -> None:
        if not trainer.logger:
            raise MisconfigurationException("Cannot use `DeviceStatsMonitor` callback with `Trainer(logger=False)`.")

        if not trainer.logger_connector.should_update_logs:
            return

        device_stats = trainer.accelerator.get_device_stats(pl_module.device)

        if pl_module.device != 'cpu' and self.cpu_stats:
            device_stats.update(CPUAccelerator.get_device_stats())

        ...

ananthsub · 2022-01-04T21:57:43Z

@tchaton that looks reasonable to me!

tchaton · 2022-01-05T08:05:33Z

@carmocca @awaelchli Any thoughts on this?

@EricWiener Would you have some interest in contributing this feature?

awaelchli · 2022-01-05T10:54:41Z

Yes looks good, if the stats collection happens on the accelerator this is the only way I see currently. Lightning always has one accelerator, but uses both CPU and the extra device together. This topic will come up again in the future.

EricWiener · 2022-01-26T20:48:23Z

@carmocca @awaelchli Any thoughts on this?

@EricWiener Would you have some interest in contributing this feature?

So sorry just saw this when I searched for the issue again. Will try to work on this this week or next

EricWiener added the feature Is an improvement or enhancement label Dec 24, 2021

ananthsub added callback: device stats accelerator: cpu Central Processing Unit labels Dec 24, 2021

ananthsub added this to the 1.6 milestone Dec 24, 2021

ananthsub changed the title ~~CPU Profiling~~ CPU System Metrics Dec 24, 2021

ananthsub changed the title ~~CPU System Metrics~~ CPU System Metrics collection Dec 24, 2021

ananthsub mentioned this issue Dec 25, 2021

[1/4] Add get_device_stats to accelerator interface #9586

Merged

12 tasks

carmocca modified the milestones: 1.6, future Feb 1, 2022

carmocca added the help wanted Open to be worked on label Feb 1, 2022

EricWiener mentioned this issue Feb 7, 2022

Track CPU stats with DeviceStatsMonitor #11795

Merged

12 tasks

carmocca mentioned this issue Feb 7, 2022

Improved control of device stats callbacks #11796

Open

carmocca modified the milestones: future, 1.7 Mar 1, 2022

kaushikb11 closed this as completed in #11795 May 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CPU System Metrics collection #11253

CPU System Metrics collection #11253

EricWiener commented Dec 24, 2021 •

edited by ananthsub

Loading

ananthsub commented Dec 24, 2021 •

edited

Loading

Uh oh!

EricWiener commented Dec 24, 2021 •

edited

Loading

Uh oh!

ananthsub commented Dec 25, 2021 •

edited

Loading

Uh oh!

EricWiener commented Dec 25, 2021

Uh oh!

tchaton commented Jan 4, 2022 •

edited

Loading

Uh oh!

ananthsub commented Jan 4, 2022

Uh oh!

tchaton commented Jan 5, 2022

Uh oh!

awaelchli commented Jan 5, 2022

Uh oh!

EricWiener commented Jan 26, 2022

Uh oh!

CPU System Metrics collection #11253

CPU System Metrics collection #11253

Comments

EricWiener commented Dec 24, 2021 • edited by ananthsub Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

ananthsub commented Dec 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EricWiener commented Dec 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ananthsub commented Dec 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EricWiener commented Dec 25, 2021

Uh oh!

tchaton commented Jan 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ananthsub commented Jan 4, 2022

Uh oh!

tchaton commented Jan 5, 2022

Uh oh!

awaelchli commented Jan 5, 2022

Uh oh!

EricWiener commented Jan 26, 2022

Uh oh!

EricWiener commented Dec 24, 2021 •

edited by ananthsub

Loading

ananthsub commented Dec 24, 2021 •

edited

Loading

EricWiener commented Dec 24, 2021 •

edited

Loading

ananthsub commented Dec 25, 2021 •

edited

Loading

tchaton commented Jan 4, 2022 •

edited

Loading