[feature request] Can we decouple metrics publishing from LightningLoggerBase? #11209

wilson100hong · 2021-12-21T21:47:51Z

🚀 Problem

In our production environment for Lightning, we log metrics by using LightningLoggerBases and publish them into multiple sources. For example, we publish QPS to TensorBoard, databases, dashboards and monitors. Each publishing share the same metrics and aggregation logics, but can have different flush frequency settings to avoid DDoS their backend.

What makes things more complex is, we have multiple Loggers to record metrics from difference sources (model, framework, etc) and some Loggers may share publishing destinations (but not identical). For example, we have a model metrics Logger and a system metrics Logger both publishes to TensorBoard, but model metrics Logger does not publish to system monitor as system metrics Logger does.

Motivation

The motivation is in our use case, we need to publish the same metrics to multiple databases, including Tensorboard, dashboards and monitoring system. Since those data sinks share the same metrics data, and the same aggregation logics, so we want to use "single" logger instead of LoggerCollection to simplify the code. Also, decoupling MetricsPublisher out enables us to customize arbitrary combinations of metrics publishing by reusing existing implementations.

Pitch

This decoupling provides isolation from logging and publishing, and simplifies LightningLoggerBase. It increase code reusability without re-implement similar publishing logics and throttling control across Loggers

It does not introduce compatibility issue since it does not change existing APIs; users can still use exiting logger implementation coupled with publishing.

Proposal

Introduce a new interface MetrcsPublisher. Users implement subclass to publish metrics to specific data sinks. MetricsPublisher also provides configurable publishing frequency control to throttle flushing, thus to avoid DDoS the backend. The interface would be like:

class MetricsPublisher:
  def __init__(
          self,
          prefix: Optional[str],
          # The below two args are for publish frequency control. 
          flush_every_n_publish: int,  
          flush_interval_ms: int,
      ) -> None:

  def publish(self, metrics: Dict[str, float], flush: bool = False) -> None:
    # flush is forward to the client of actual data sink to determine whether to flush client side data.

Introduce MetricsLogger, a subclass of LightningLoggerBase. MetricsLogger is a pure Logger but can be attached with MetricsPublishers. When save() or close() is called, it will iterate all attached MetricsPublishers and invoke their publish():

class MetricsLogger(LightningLoggerBase):
 def __init__(
        self,
        publishers: Optional[List[MetricsPublisher]] = None) -> None:

def save(self) -> None:
    for publisher in self._publishers:
        publisher.publish(metrics=metrics, flush=False)

def close(self) -> None:
    for publisher in self._publishers:
        publisher.publish(metrics=metrics, flush=True)

Alternatives

N/A, so far I didn't see alternative for such decoupling. But if so, happy to discuss.

Additional context

N/A

cc @Borda @tchaton @justusschock @awaelchli @edward-io @ananthsub @rohitgr7 @kamil-kaczmarek @Raalsky @Blaizzy

The text was updated successfully, but these errors were encountered:

ananthsub · 2021-12-22T08:45:06Z

@wilson100hong to better understand your request:

It seems like the publisher is the log_metrics subset of the existing logger APIs.

Do you feel like you have to work around the existing logger APIs? If so, is that because the logger API carries bloat, which leads to a lack of clarity as a developer? If so, do you think we could improve this instead by streamlining the existing logger API? Or do you think we truly need a separate publisher interface?

For a "publisher" like TensorBoard or MLFlow, does this mean there'd need to be both publisher and logger implementations to use them? Where do you draw the lines between them? Do you have an example of how you'd configure this as an end user?

Some areas we could streamline:

Remove the aggregation logic from the base logger API.

This exists to check if there are multiple (key, value) pairs logged for the same step index. Looking back at the issue which drove this addition (#1173), I don't think the motivation for having a dedicated agg_and_log_metrics exists anymore, since we now use the global step for determining when to update the logs:
https://github.com/PyTorchLightning/pytorch-lightning/blob/eb5b350f9a6bd27a66dfebcb00b3acb33b7bbb89/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py#L73-L76

@carmocca is my understanding correct?

I personally have never used or even seen others use the agg_key_funcs / agg_default_func arguments on the base logger constructor: https://github.com/PyTorchLightning/pytorch-lightning/blob/eb5b350f9a6bd27a66dfebcb00b3acb33b7bbb89/pytorch_lightning/loggers/base.py#L67-L68

Nor have I seen users update them via the setter: https://github.com/PyTorchLightning/pytorch-lightning/blob/eb5b350f9a6bd27a66dfebcb00b3acb33b7bbb89/pytorch_lightning/loggers/base.py#L83-L102

Moreover, none of the logger implementations even pass through those args to the super().__init__() initialization! I'm sure if these were used, that at least the commonly used logger implementations would support this customization.

Would deprecating this provide clarity around when data is actually sent to the publishing clients inside of the logger?

Do you feel the use of the LoggerCollection abstraction makes it harder to reason about using multiple loggers?

Such a collection is somewhat handy for writing data, but is much less helpful when reading data, especially when individual loggers have different properties. It's not clear at all what LoggerCollection.name should be. The concatenation of all its instance loggers names? The first logger's names?

Importantly, we offer no such LoggerCollection-equivalent for Callbacks. We treat them as a flat list. Why are loggers any different? A simplification here be to formally support a sequence of loggers in the trainer, drop the assumption of a single logger being used (from the trainer's POV), and simply iterate over a sequence of loggers for the write calls.

Do you feel that the logger API is otherwise too open ended? For instance, mandating an experiment (which could be anything) to do custom work? I personally don't think this should be on the logger interface since there is no interface Lightning prescribes for it. If users are relying on the experiment, it's almost certain they know what Logger implementation they're dealing with. In this case, it's much easier having the summary writer being a public property of the tensorboard logger.
The base Logger API makes an implicit assumption that data is being logged to files. As you mentioned that you are logging to a DB, do you think properties like this are overhead? https://github.com/PyTorchLightning/pytorch-lightning/blob/eb5b350f9a6bd27a66dfebcb00b3acb33b7bbb89/pytorch_lightning/loggers/base.py#L342-L346

Some areas we have streamlined:

We deprecated flush_logs_every_n_steps from the Trainer as every logger could handle this differently.

Introduce a new interface MetrcsPublisher. Users implement subclass to publish metrics to specific data sinks. MetricsPublisher also provides configurable publishing frequency control to throttle flushing, thus to avoid DDoS the backend.

As you mentioned throttling, would the same principle apply here? Throttling could also sit at the logger implementation.

edward-io · 2021-12-22T09:09:18Z

Thanks for writing this up!

A few questions:

For the MetricsPublisher API, flush_every_n_publish, flush_interval_ms should these be optional, or do they both have to be defined, or can only one or the other be defined, but not both? Will calling publish(..., flush=True) override the behavior? If so, does it reset the counters?
Does it make sense to use the MetricsPublisher abstraction for existing loggers as well? For example, should we refactor TensorBoardLogger to call a TensorBoardMetricsPublisher? Otherwise, will we have duplicate logic in TensorBoardLogger and TensorBoardMetricsPublisher?

stale · 2022-01-22T14:41:22Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

wilson100hong added the feature Is an improvement or enhancement label Dec 21, 2021

wilson100hong changed the title ~~Decouple metrics logging and publishing~~ [feature request] Can we decouple metrics publishing from LightningLoggerBase? Dec 21, 2021

ananthsub added design Includes a design discussion logger Related to the Loggers labels Dec 21, 2021

stale bot added the won't fix This will not be worked on label Jan 22, 2022

stale bot closed this as completed Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feature request] Can we decouple metrics publishing from LightningLoggerBase? #11209

[feature request] Can we decouple metrics publishing from LightningLoggerBase? #11209

wilson100hong commented Dec 21, 2021 •

edited by github-actions bot

Loading

ananthsub commented Dec 22, 2021 •

edited

Loading

Uh oh!

edward-io commented Dec 22, 2021 •

edited

Loading

Uh oh!

stale bot commented Jan 22, 2022

Uh oh!

[feature request] Can we decouple metrics publishing from LightningLoggerBase? #11209

[feature request] Can we decouple metrics publishing from LightningLoggerBase? #11209

Comments

wilson100hong commented Dec 21, 2021 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Problem

Motivation

Pitch

Proposal

Alternatives

Additional context

ananthsub commented Dec 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edward-io commented Dec 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stale bot commented Jan 22, 2022

Uh oh!

wilson100hong commented Dec 21, 2021 •

edited by github-actions bot

Loading

ananthsub commented Dec 22, 2021 •

edited

Loading

edward-io commented Dec 22, 2021 •

edited

Loading