Callback for logging forward, backward and update time #19928

MattMcPartlon · 2024-05-31T18:41:58Z

MattMcPartlon
May 31, 2024

I'm trying to track the performance of forward/backward/update time with a Callback. My current implementation is showing strange behavior.

It seems that the callback order is (at least functionally) different when using gradient accumulation !=1. This is expected, but it's unclear how to handle both cases with a single callback.

My ask

I'd really appreciate help coming up with an almost-correct implementation for tracking (1) forward pass time, (2) backward pass time, (3) total time for an update (forward + backward + optimizer step) which might depend on gradient accumulation and (4) amount of time spent waiting on the dataloader to generate the next batch. Alternatively, for (3) I'm happy to track only optimizer.step time since this should tell me how long it's taking for devices to sync and gradients to update. I'm open to tracking related metrics or other metrics entirely as long as they're correlated with model throughput/performance.

In addition, I'm wondering how these metrics should be logged. i.e. should I set sync_dist=False since I only care about logging these metrics for training. Should I remove the rank_zero_only decorators? Any input is greatly appreciated.

Thank you!

NOTE: I already know that my implementation is not correct :).

How I'm currently implementing this

last updates per second: This is measured as one divided by the time between consecutive calls to on_train_batch_end.

average updates per second: This is measured as number of calls to on_train_batch_end in the current epoch divided by the elapsed time since on_train_epoch_start

forward time: difference in time between on_train_batch_start and on_before_backwards.

backward time: difference in time between on_before_backwards and on_after_backwards

between step time: difference in time between on_train_batch_end and on_train_batch_start (meant to capture time spent waiting on dataloader to generate next example). I realize there is other overhead geting tracked here but I couldn't figure out a better way.

This is what the metrics look like in WandB

Note: both runs use gradient accumulation with value of 4.

Here is the implementation

import logging
import time
from typing import Any

import lightning as L
from lightning.pytorch import Callback, Trainer
from lightning.pytorch.utilities import rank_zero_only
from lightning.pytorch.utilities.types import STEP_OUTPUT

logger = logging.getLogger(__name__)


class LogPerformanceCallback(Callback):

    def __init__(self):
        super().__init__()
        self.start_time = 0.0
        self.last_batch_end_time = 0.0
        self.update_count = 0.0
        self.backward_start_time = 0.0
        self.forward_start_time = 0.0
        self.between_step_time = 0.0

    @rank_zero_only
    def on_train_start(
        self,
        trainer: Trainer,
        pl_module: L.LightningModule,
    ):
        super().on_train_start(trainer, pl_module)
        self.start_time = time.time()
        self.last_batch_end_time = time.time()
        self.between_step_time = time.time()

    @rank_zero_only
    def on_train_batch_start(self, trainer, pl_module, batch, batch_idx):
        super().on_train_batch_start(trainer, pl_module, batch, batch_idx)
        pl_module.log(
            "performance/between_step_time",
            time.time() - self.between_step_time,
            on_step=True,
            on_epoch=False,
        )
        self.forward_start_time = time.time()

    @rank_zero_only
    def on_before_backward(self, trainer, pl_module, loss):
        super().on_before_backward(trainer, pl_module, loss)
        forward_time = time.time() - self.forward_start_time
        pl_module.log(
            "performance/forward_time",
            forward_time,
            on_step=True,
            on_epoch=False,
        )
        self.backward_start_time = time.time()

    @rank_zero_only
    def on_after_backward(self, trainer, pl_module):
        super().on_after_backward(trainer, pl_module)
        backward_time = time.time() - self.backward_start_time
        pl_module.log(
            "performance/backward_time",
            backward_time,
            on_step=True,
            on_epoch=False,
        )

    @rank_zero_only
    def on_train_epoch_start(self, *args, **kwargs) -> None:
        super().on_train_epoch_start(*args, **kwargs)
        self.update_count = 0.0
        self.start_time = time.time()
        self.last_batch_end_time = time.time()
        self.between_step_time = time.time()

    @rank_zero_only
    def on_train_batch_end(
        self,
        trainer: Trainer,
        pl_module: L.LightningModule,
        outputs: STEP_OUTPUT,
        batch: Any,
        batch_idx: int,
    ):
        super().on_train_batch_end(trainer, pl_module, outputs, batch, batch_idx)
        self.update_count += 1

        # Calculate total elapsed time
        total_elapsed_time = time.time() - self.start_time
        last_elapsed_time = time.time() - self.last_batch_end_time
        self.last_batch_end_time = time.time()

        # Calculate updates per second
        average_updates_per_second = self.update_count / total_elapsed_time
        last_updates_per_second = 1 / last_elapsed_time

        # Log updates per second to wandb using pl_module.log
        pl_module.log(
            "performance/average_updates_per_second",
            average_updates_per_second,
            on_step=True,
            on_epoch=False,
        )
        pl_module.log(
            "performance/last_updates_per_second",
            last_updates_per_second,
            on_step=True,
            on_epoch=False,
        )
        self.between_step_time = time.time()

HaukurPall · 2025-03-25T11:04:06Z

HaukurPall
Mar 25, 2025

Thanks for the implementation. I changed it a bit and here is my implementation:

class LogPerformanceCallback(Callback):

    def __init__(self):
        super().__init__()

    def on_train_batch_start(self, trainer: Trainer, pl_module: LightningModule, batch, batch_idx):
        self.batch_start=time.perf_counter()

    def on_train_batch_end(
            self,
            trainer: Trainer,
            pl_module: LightningModule,
            outputs: STEP_OUTPUT,
            batch: Any,
            batch_idx: int,
        ):
        batch_time = time.perf_counter() - self.batch_start
        pl_module.log(
            "train/forward_time_seconds",
            batch_time,
            on_step=True,
            on_epoch=False,
            rank_zero_only=True  # no need to log for all devices
        )
        # Also log the number of tokens processed per second which contribute to the loss
        num_loss_tokens = (batch["labels"] != NO_LOSS_INDEX).sum().item()
        pl_module.log(
            "train/tps_per_device",
            num_loss_tokens / batch_time,
            prog_bar=True,  # show in progress bar
            on_step=True,
            on_epoch=False,
            rank_zero_only=True  # no need to log for all devices
        )

    def on_before_backward(self, trainer, pl_module, loss):
        self.backward_start = time.perf_counter()

    def on_after_backward(self, trainer, pl_module):
        backward_time = time.perf_counter() - self.backward_start
        pl_module.log(
            "train/backward_time_seconds",
            backward_time,
            on_step=True,
            on_epoch=False,
            rank_zero_only=True  # no need to log for all devices
        )

    def on_validation_batch_start(self, trainer: Trainer, pl_module: LightningModule, batch, batch_idx, dataloader_idx=0):
        self.val_batch_start = time.perf_counter()

    def on_validation_batch_end(
            self,
            trainer: Trainer,
            pl_module: LightningModule,
            outputs: STEP_OUTPUT,
            batch: Any,
            batch_idx: int,
            dataloader_idx=0
        ):
        batch_time = time.perf_counter() - self.val_batch_start
        pl_module.log(
            "validation/forward_time_seconds",
            batch_time,
            on_step=True,
            on_epoch=False,
            rank_zero_only=True  # no need to log for all devices
        )
        # Also log the number of tokens processed per second which contribute to the loss
        num_loss_tokens = (batch["labels"] != NO_LOSS_INDEX).sum().item()
        pl_module.log(
            "validation/tps_per_device",
            num_loss_tokens / batch_time,
            on_step=True,
            on_epoch=False,
            rank_zero_only=True  # no need to log for all devices
        )

    def on_before_optimizer_step(self, trainer: Trainer, pl_module: LightningModule, optimizer: Any) -> None:
        self.step_start = time.perf_counter()

    def on_before_zero_grad(self, trainer: Trainer, pl_module: LightningModule, optimizer: Any) -> None:
        # This will get called at the beginning of training to clear any gradients from tuning etc.
        # In those cases the step_start is not set so we do nothing.
        if not hasattr(self, "step_start"):
            return
        step_end = time.perf_counter() - self.step_start
        pl_module.log(
            "train/step_time_seconds",
            step_end,
            on_step=True,
            on_epoch=False,
            rank_zero_only=True  # no need to log for all devices
        )

I have not tested it a lot but it seems to work for me. There might be a bug w.r.t. zero_grad after validation as I don't know whether zero_grad is called after the validation loop before starting training again.
Please let me know if you find bugs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Callback for logging forward, backward and update time #19928

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Callback for logging forward, backward and update time #19928

MattMcPartlon May 31, 2024

My ask

How I'm currently implementing this

This is what the metrics look like in WandB

Here is the implementation

Replies: 1 comment

HaukurPall Mar 25, 2025

MattMcPartlon
May 31, 2024

HaukurPall
Mar 25, 2025