Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading checkpoint from CLI using SLURM doesn't use GPU even though it says it does #20689

Open
nathanchenseanwalter opened this issue Apr 1, 2025 · 0 comments
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.5.x

Comments

@nathanchenseanwalter
Copy link

Bug description

When I load my checkpoint, it says LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

But I check my slurm jobstats and it's like

GPU utilization per node
stellar-m01g3 (GPU 0): 0% <--- GPU was not used

GPU memory usage per node - maximum used/total
stellar-m01g3 (GPU 0): 12.2GB/40.0GB (30.5%)

I even made sure to later put accelerator: gpu in the trainer section of the yaml file

What version are you seeing the problem on?

v2.5

How to reproduce the bug

cli = ModelCLI(
        subclass_mode_model=True, 
        subclass_mode_data=True,
        parser_kwargs={"parser_mode": "omegaconf"},
        save_config_callback=None,
        )


trainer:
  max_epochs: 10
  accelerator: gpu
  enable_progress_bar: False

#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00

module purge
source ...

export SLURM_JOB_ID=$SLURM_JOB_ID

srun python -m specseg.models.train \
    fit \
    --config config/model/config_label.yaml \
    --ckpt_path /path

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

@nathanchenseanwalter nathanchenseanwalter added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.5.x
Projects
None yet
Development

No branches or pull requests

1 participant