Loading checkpoint from CLI using SLURM doesn't use GPU even though it says it does #20689

nathanchenseanwalter · 2025-04-01T06:59:00Z

Bug description

When I load my checkpoint, it says LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

But I check my slurm jobstats and it's like

GPU utilization per node
stellar-m01g3 (GPU 0): 0% <--- GPU was not used

GPU memory usage per node - maximum used/total
stellar-m01g3 (GPU 0): 12.2GB/40.0GB (30.5%)

I even made sure to later put accelerator: gpu in the trainer section of the yaml file

What version are you seeing the problem on?

v2.5

How to reproduce the bug

cli = ModelCLI(
        subclass_mode_model=True, 
        subclass_mode_data=True,
        parser_kwargs={"parser_mode": "omegaconf"},
        save_config_callback=None,
        )


trainer:
  max_epochs: 10
  accelerator: gpu
  enable_progress_bar: False

#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00

module purge
source ...

export SLURM_JOB_ID=$SLURM_JOB_ID

srun python -m specseg.models.train \
    fit \
    --config config/model/config_label.yaml \
    --ckpt_path /path

Error messages and logs

# Error messages and logs here please

Environment

Current environment

#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

The text was updated successfully, but these errors were encountered:

nathanchenseanwalter added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 1, 2025

github-actions bot added the ver: 2.5.x label Apr 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading checkpoint from CLI using SLURM doesn't use GPU even though it says it does #20689

Loading checkpoint from CLI using SLURM doesn't use GPU even though it says it does #20689

nathanchenseanwalter commented Apr 1, 2025

Loading checkpoint from CLI using SLURM doesn't use GPU even though it says it does #20689

Loading checkpoint from CLI using SLURM doesn't use GPU even though it says it does #20689

Comments

nathanchenseanwalter commented Apr 1, 2025

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info