Skip to content

Lightning throws "bypassing sigterm" on DDP for unknown reason #10154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gitabtion opened this issue Oct 26, 2021 · 2 comments · Fixed by #10189
Closed

Lightning throws "bypassing sigterm" on DDP for unknown reason #10154

gitabtion opened this issue Oct 26, 2021 · 2 comments · Fixed by #10189
Assignees
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@gitabtion
Copy link

gitabtion commented Oct 26, 2021

🐛 Bug

To train the model using big data, I have split all the data to some parts, I use fit function to train each part in turn. "bypassing sigterm" will occur in the second part.

To Reproduce

import pytorch_lightning as pl
import transformers as tfs
from torch.utils.data import DataLoader

from datasets import load_dataset
from src.data.collators import DataCollatorForMacBert
from src.data.tokenize_funcs import SearchDataTokenizeFunc
from src.modeling.modeling_berts import BertsForPretraining
from src.tools.bases import args_parse
from src.utils import get_abs_path


def get_loader_for_text(tokenize_function, data_collator, data_files):
    extension = 'text'
    raw_datasets = load_dataset(extension, data_files=data_files)
    column_names = raw_datasets["train"].column_names
    text_column_name = "text" if "text" in column_names else column_names[0]

    tokenized_datasets = raw_datasets.map(
        tokenize_function,
        batched=True,
        # num_proc=cfg.DATALOADER.NUM_WORKERS,
        num_proc=4,
        remove_columns=[text_column_name],
        # keep_in_memory=True,
        load_from_cache_file=True
    )
    train_dataset = tokenized_datasets["train"]

    # Log a few random samples from the training set:
    # for index in random.sample(range(len(train_dataset)), 3):
    #     logger.info(f"Sample {index} of the training set: {train_dataset[index]}.")

    # Data collator
    # This one will take care of randomly masking the tokens.
    # DataLoaders creation:
    train_dataloader = DataLoader(
        train_dataset,
        shuffle=True,
        collate_fn=data_collator,
        batch_size=4,
        num_workers=2,
        # persistent_workers=True,
    )

    return train_dataloader, None, None


def run():
    tokenizer = tfs.AutoTokenizer.from_pretrained('bert-base-chinese')
    collator = DataCollatorForWholeWordMask(tokenizer=tokenizer, mlm_probability=0.15,
                                      pad_to_multiple_of=8)
    tokenize_func = SearchDataTokenizeFunc(tokenizer)
    data_files = [get_abs_path('datasets/search/part--100')]
    loader, _, _ = get_loader_for_text(tokenize_func, collator, data_files)
    model = BertsForPretraining('bert-base-chinese')
    trainer = pl.Trainer(precision=16, max_epochs=1, gpus=4, strategy='ddp')
    trainer.fit(model, loader)

    data_files = [get_abs_path('datasets/search/part-00000')]
    loader, _, _ = get_loader_for_text(tokenize_func, collator, data_files)
    trainer.fit(model, loader)

if __name__ == '__main__':
    run()

Expected behavior

Environment

  • PyTorch Lightning Version (e.g., 1.3.0): 1.5.0rc1
  • PyTorch Version (e.g., 1.8): 1.10.0+cu113
  • Python version: 3.8.10
  • OS (e.g., Linux): ubuntu20.04
  • CUDA/cuDNN version: 11.4
  • GPU models and configuration:
  • How you installed PyTorch (conda, pip, source): pip
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information: from NGC pytorch:21.09

Additional context

Here is Dockerfile

FROM nvcr.io/nvidia/pytorch:21.09-py3
RUN pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html && pip install pytorch-lightning==1.5.0rc1 && pip install -U transformers yacs pkuseg pypinyin deepspeed datasets tqdm wandb
RUN pip uninstall -y torchtext
@gitabtion gitabtion added bug Something isn't working help wanted Open to be worked on labels Oct 26, 2021
@awaelchli awaelchli self-assigned this Oct 26, 2021
@awaelchli
Copy link
Contributor

awaelchli commented Oct 26, 2021

@gitabtion thanks for checking out 1.5.0rc1!
The message "bypassing sigterm" should only appear if a sigterm is sent to the process. Do you know anything about that? Do you run your script just like a regular python command, python train.py? or some different way?

@gitabtion
Copy link
Author

gitabtion commented Oct 27, 2021

@gitabtion thanks for checking out 1.5.0rc1! The message "bypassing sigterm" should only appear if a sigterm is sent to the process. Do you know anything about that? Do you run your script just like a regular python command, python train.py? or some different way?

I don‘t know about the sigterm signal, and I just run the lightning script by ''python train.py''

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants