logger_connector has edge case where step can be a float #20692

ryxli · 2025-04-02T00:45:28Z

Bug description

causes issue in mlflow: mlflow/mlflow#15180

There is some logic in logger_connector where when step is part of the recorded metrics, it can be used to pass into logger.log_metrics:

ref

        # turn all tensors to scalars
        scalar_metrics = convert_tensors_to_scalars(metrics)  <-- 'step' casted to float tensor gets casted to float scalar

        if step is None:
            step = scalar_metrics.pop("step", None)
        ...

        if step is None:
            # added metrics for convenience
            scalar_metrics.setdefault("epoch", self.trainer.current_epoch)
            step = self.trainer.fit_loop.epoch_loop._batches_that_stepped

        # log actual metrics
        for logger in self.trainer.loggers:
            logger.log_metrics(metrics=scalar_metrics, step=step)  <-- pass down to mlflow here
            logger.save()

The metric always gets casted to a float here

        if self.is_tensor:
            value = cast(Tensor, value)
            dtype = _get_default_dtype()
            if not torch.is_floating_point(value):
                warning_cache.warn(
                    # do not include the value to avoid cache misses
                    f"You called `self.log({self.meta.name!r}, ...)` in your `{self.meta.fx}` but the value needs to"
                    f" be floating to be reduced. Converting it to {dtype}."
                    " You can silence this warning by converting the value to floating point yourself."
                    " If you don't intend to reduce the value (for instance when logging the global step or epoch) then"
                    f" you can use `self.logger.log_metrics({{{self.meta.name!r}: ...}})` instead."
                )
                value = value.to(dtype)
            if value.dtype not in (torch.float32, torch.float64):
                value = value.to(dtype)

This can cause some issues, for example, in NeMo which uses this logic to pass the step to logger.log_metrics(metrics=scalar_metrics, step=step):

    @override
    def training_step(self, dataloader_iter, *args: Any, **kwargs: Any) -> STEP_OUTPUT:
            ...
            self.lightning_module.log(
                "step",
                self.trainer.global_step,
            )
            ...

Aware that you're not exactly supposed to use lightning_module.log(value) for integers, but this is something done above in NeMo (#18739).

I'm not exactly clear on why this is done, but have seen in the past where in the fallback case, self.trainer.fit_loop.epoch_loop._batches_that_stepped does not always reflect trainer.global_step, which is why I'm assuming that logic is included:

        if step is None:
            # added metrics for convenience
            scalar_metrics.setdefault("epoch", self.trainer.current_epoch)
            step = self.trainer.fit_loop.epoch_loop._batches_that_stepped

What version are you seeing the problem on?

v2.5, v2.4

How to reproduce the bug

This does end up causing issues in downstream loggers such as mlflow:

    @staticmethod
    def _get_metric_from_line(metric_name, metric_line, exp_id):
        ...
        step = int(metric_parts[2]) if len(metric_parts) == 3 else 0
        return Metric(key=metric_name, value=val, timestamp=ts, step=step)


But seems like it should be fixed in lightning instead since the [function signature](https://lightning.ai/docs/pytorch/stable/_modules/lightning/pytorch/loggers/mlflow.html#MLFlowLogger.log_metrics) of `log_metrics` is indeed an int.

Error messages and logs

No response

Environment

No response

More info

Seems like an easy fix to do a double check or cast to in in logger_connector:

pytorch-lightning/src/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py

Line 109 in ca13f77

step = scalar_metrics.pop("step", None)

        if step is None:
            step = int(scalar_metrics.pop("step", None))

The text was updated successfully, but these errors were encountered:

ryxli added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 2, 2025

github-actions bot added ver: 2.4.x ver: 2.5.x labels Apr 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

logger_connector has edge case where step can be a float #20692

logger_connector has edge case where step can be a float #20692

ryxli commented Apr 2, 2025 •

edited

Loading

logger_connector has edge case where step can be a float #20692

logger_connector has edge case where step can be a float #20692

Comments

ryxli commented Apr 2, 2025 • edited Loading

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

ryxli commented Apr 2, 2025 •

edited

Loading