-
Notifications
You must be signed in to change notification settings - Fork 3.5k
CPU System Metrics collection #11253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@EricWiener would you prefer that seeing CPU stats automatically, even when training on GPU? And to confirm, do you want CPU stats tracked at the same cadence/hooks as GPU stats? I retitled it the issue to be about system metrics collection to avoid confusion with profiling/Lightning profilers |
Automatic stats would be very nice, but it seems a little strange to require callbacks for GPU/XLA stats but have CPU stats automatically tracked. Also, if it were to be automatically tracked, it would be nice if iteration speed was tracked as well.
Ideally one could specify the frequency (maybe by passing a list of the callbacks they want stats to be logged at). For debugging memory usage it would be nice for stats to be logged at every possible callback. However, for most cases, every |
I should've clarified, by automatic I mean when using the device stats monitor callback, not on all the time. One idea @daniellepintz and I discussed earlier was to do something like this: class CPUAccelerator(Accelerator):
_process: Optional[psutil.Process] # check if psutil available
def setup_environment(self, root_device: torch.device) -> None:
"""
Raises:
MisconfigurationException:
If the selected device is not CPU.
"""
if "cpu" not in str(root_device):
raise MisconfigurationException(f"Device should be CPU, got {root_device} instead.")
self._process = psutil.Process()
def teardown(self) -> None:
self._process = None
def get_device_stats(self, device: Union[str, torch.device]) -> Dict[str, Any]:
"""CPU device stats aren't supported yet."""
if not self._process:
return {}
return get_cpu_process_metrics(self._process)
def get_cpu_process_metrics(process: Optional[psutil.Process]) -> Dict[str, float]:
process = process or psutil.Process()
memory_info = process.memory_info()
cpu_times = process.cpu_times()
metrics["cpu_rss_memory_bytes"] = memory_info.rss
metrics["cpu_time_user"] = cpu_times.user
metrics["cpu_time_system"] = cpu_times.system
return metrics
@four4fish @awaelchli this would motivate |
Having CPU stats whenever the device stats monitor is used would be great. It would also be ideal if the user could specify additional CPU metrics they wanted (like swap memory percent) |
Hey @ananthsub, Users might want to track both their CPU + Accelerator Device (GPU, TPU, ...) usages at the same time which is a very common use case (done automatically by Wandb as an example). However, the current design relies on single accelerator to be instantiated. Do you have any idea how to resolve this? IMO, as a user, I would prefer an interface like this: # opt-in when the selected accelerator isn't 'cpu'.
Trainer(accelerator="gpu", devices=2, callbacks=DeviceStatsMonitor(cpu_stats=True)) class DeviceStatsMonitor:
...
def on_train_batch_end(
self,
trainer: "pl.Trainer",
pl_module: "pl.LightningModule",
outputs: STEP_OUTPUT,
batch: Any,
batch_idx: int,
unused: Optional[int] = 0,
) -> None:
if not trainer.logger:
raise MisconfigurationException("Cannot use `DeviceStatsMonitor` callback with `Trainer(logger=False)`.")
if not trainer.logger_connector.should_update_logs:
return
device_stats = trainer.accelerator.get_device_stats(pl_module.device)
if pl_module.device != 'cpu' and self.cpu_stats:
device_stats.update(CPUAccelerator.get_device_stats())
... |
@tchaton that looks reasonable to me! |
@carmocca @awaelchli Any thoughts on this? @EricWiener Would you have some interest in contributing this feature? |
Yes looks good, if the stats collection happens on the accelerator this is the only way I see currently. Lightning always has one accelerator, but uses both CPU and the extra device together. This topic will come up again in the future. |
So sorry just saw this when I searched for the issue again. Will try to work on this this week or next |
Uh oh!
There was an error while loading. Please reload this page.
🚀 Feature
Provide CPU profiling similar to GPU and XLA profiling provided by DeviceStatsMonitor. It would be nice if you could specify which device you wanted to profile with DevcieStatsMonitor vs. the profiling defaulting to whatever accelerator you are using.
Motivation
I am running out of CPU memory and I need to figure out where this is occurring. It would be nice if I could easily monitor CPU stats (memory usage, percent utilization, etc).
Pitch
Modify
DevcieStatsMonitor
to take adevice
arg that allows you to specify which device to profile. You can then pass multipleDeviceStatsMonitor
callbacks toTrainer
. The CPU Monitor can usepsutil
to track common memory attributes.Alternatives
N/A
Additional context
Also discussed here: #9032 (comment)
cc @Borda @kaushikb11 @awaelchli @justusschock @akihironitta @rohitgr7
The text was updated successfully, but these errors were encountered: