Lightning throws "bypassing sigterm" on DDP for unknown reason #10154

gitabtion · 2021-10-26T14:12:57Z

🐛 Bug

To train the model using big data, I have split all the data to some parts, I use fit function to train each part in turn. "bypassing sigterm" will occur in the second part.

To Reproduce

import pytorch_lightning as pl
import transformers as tfs
from torch.utils.data import DataLoader

from datasets import load_dataset
from src.data.collators import DataCollatorForMacBert
from src.data.tokenize_funcs import SearchDataTokenizeFunc
from src.modeling.modeling_berts import BertsForPretraining
from src.tools.bases import args_parse
from src.utils import get_abs_path


def get_loader_for_text(tokenize_function, data_collator, data_files):
    extension = 'text'
    raw_datasets = load_dataset(extension, data_files=data_files)
    column_names = raw_datasets["train"].column_names
    text_column_name = "text" if "text" in column_names else column_names[0]

    tokenized_datasets = raw_datasets.map(
        tokenize_function,
        batched=True,
        # num_proc=cfg.DATALOADER.NUM_WORKERS,
        num_proc=4,
        remove_columns=[text_column_name],
        # keep_in_memory=True,
        load_from_cache_file=True
    )
    train_dataset = tokenized_datasets["train"]

    # Log a few random samples from the training set:
    # for index in random.sample(range(len(train_dataset)), 3):
    #     logger.info(f"Sample {index} of the training set: {train_dataset[index]}.")

    # Data collator
    # This one will take care of randomly masking the tokens.
    # DataLoaders creation:
    train_dataloader = DataLoader(
        train_dataset,
        shuffle=True,
        collate_fn=data_collator,
        batch_size=4,
        num_workers=2,
        # persistent_workers=True,
    )

    return train_dataloader, None, None


def run():
    tokenizer = tfs.AutoTokenizer.from_pretrained('bert-base-chinese')
    collator = DataCollatorForWholeWordMask(tokenizer=tokenizer, mlm_probability=0.15,
                                      pad_to_multiple_of=8)
    tokenize_func = SearchDataTokenizeFunc(tokenizer)
    data_files = [get_abs_path('datasets/search/part--100')]
    loader, _, _ = get_loader_for_text(tokenize_func, collator, data_files)
    model = BertsForPretraining('bert-base-chinese')
    trainer = pl.Trainer(precision=16, max_epochs=1, gpus=4, strategy='ddp')
    trainer.fit(model, loader)

    data_files = [get_abs_path('datasets/search/part-00000')]
    loader, _, _ = get_loader_for_text(tokenize_func, collator, data_files)
    trainer.fit(model, loader)

if __name__ == '__main__':
    run()

Expected behavior

Environment

PyTorch Lightning Version (e.g., 1.3.0): 1.5.0rc1
PyTorch Version (e.g., 1.8): 1.10.0+cu113
Python version: 3.8.10
OS (e.g., Linux): ubuntu20.04
CUDA/cuDNN version: 11.4
GPU models and configuration:
How you installed PyTorch (conda, pip, source): pip
If compiling from source, the output of torch.__config__.show():
Any other relevant information: from NGC pytorch:21.09

Additional context

Here is Dockerfile

FROM nvcr.io/nvidia/pytorch:21.09-py3
RUN pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html && pip install pytorch-lightning==1.5.0rc1 && pip install -U transformers yacs pkuseg pypinyin deepspeed datasets tqdm wandb
RUN pip uninstall -y torchtext

The text was updated successfully, but these errors were encountered:

awaelchli · 2021-10-26T15:02:09Z

@gitabtion thanks for checking out 1.5.0rc1!
The message "bypassing sigterm" should only appear if a sigterm is sent to the process. Do you know anything about that? Do you run your script just like a regular python command, python train.py? or some different way?

gitabtion · 2021-10-27T02:14:45Z

@gitabtion thanks for checking out 1.5.0rc1! The message "bypassing sigterm" should only appear if a sigterm is sent to the process. Do you know anything about that? Do you run your script just like a regular python command, python train.py? or some different way?

I don‘t know about the sigterm signal, and I just run the lightning script by ''python train.py''

gitabtion added bug Something isn't working help wanted Open to be worked on labels Oct 26, 2021

awaelchli self-assigned this Oct 26, 2021

This was referenced Oct 27, 2021

[Feat] Add graceful detection of signal to exit + SignalConnector and merge SlurmConnector. #9566

Merged

Fix sigterm signal handling #10189

Merged

awaelchli closed this as completed in #10189 Oct 29, 2021

YannDubs mentioned this issue Sep 26, 2022

Lightning sends SIGTERM when using other SLURM manager #14893

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lightning throws "bypassing sigterm" on DDP for unknown reason #10154

Lightning throws "bypassing sigterm" on DDP for unknown reason #10154

gitabtion commented Oct 26, 2021 •

edited

Loading

awaelchli commented Oct 26, 2021 •

edited

Loading

Uh oh!

gitabtion commented Oct 27, 2021 •

edited

Loading

Uh oh!

Lightning throws "bypassing sigterm" on DDP for unknown reason #10154

Lightning throws "bypassing sigterm" on DDP for unknown reason #10154

Comments

gitabtion commented Oct 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

awaelchli commented Oct 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gitabtion commented Oct 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gitabtion commented Oct 26, 2021 •

edited

Loading

awaelchli commented Oct 26, 2021 •

edited

Loading

gitabtion commented Oct 27, 2021 •

edited

Loading