-
Notifications
You must be signed in to change notification settings - Fork 3.5k
batch_size selected by auto_scale_batch_size triggers out of memory error #9625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I read that this functionality is based on https://github.com/BlackHC/toma and noticed that they invoke https://github.com/BlackHC/toma/blob/master/toma/torch_cuda_memory.py#L10 after an out of memory occurs. I don't know if you do the same but when I tried adding:
immediately after |
@SkafteNicki Mind having a look into this ? |
I believe I am experiencing the same issue. I manage to evade the issue by manually calling trainer = pl.Trainer(...)
trainer.tuner.scale_batch_size(model, ...)
del trainer
# then train the model |
I experienced a similar behaviour as @cowwoc, especially in combination with |
I am having the same issue. Also the batch size that is found for me with "binsearch" is smaller than what I can use by manually selecting:
|
I have a similar issue when using binsearch batch scaling. I found that in batch_size_scaling function _run_binsearch_scaling the reset_train_dataloader is not called when you run into an OOM error. So the dataloader batch_size was kept as the failed one. |
I tried "power" scaling too and the issue remains, so it appears it's not just that, but this is also a bug to note. (Binsearch can return a different value from Power mode, as it searches further after dropping the size down, so it may be that your final Power-mode size was lower than what Binsearch found and happens to work?) When inspecting the GPU memory after the tuner returns, 10GB was still reserved somehow, convincing me this is in fact a leak of some sort. (GC and cache-clearing do nothing for this) |
Any news on this? I tried to debug this a bit, but I haven't found any additional hints yet. #10243 makes sense, but doesn't seem to solve the problem. |
When encountering an OOM CUDA error, train loaders are not reloaded, so basically it keeps using the highest batch size, no matter if it did yield an OOM error or not. One workaround is to force the loader reloading at the beginning of each loop, i.e in def _run_power_scaling(trainer: "pl.Trainer", model: "pl.LightningModule",
new_size: int, batch_arg_name: str,
max_trials: int) -> int:
"""Batch scaling mode where the size is doubled at each iteration until an OOM error is encountered."""
for _ in range(max_trials):
trainer.reset_train_dataloader(model) # FORCE LOADER RESET and def _run_binsearch_scaling(trainer: "pl.Trainer", model: "pl.LightningModule",
new_size: int, batch_arg_name: str,
max_trials: int) -> int:
"""Batch scaling mode where the size is initially is doubled at each iteration
until an OOM error is encountered. Hereafter, the batch size is further
refined using a binary search"""
low = 1
high = None
count = 0
while True:
trainer.reset_train_dataloader(model) # FORCE LOADER RESET |
I tested simply calling tuner.lr_find function without calling auto_scale_batch_size first... and it reports OOM error as well. |
I have a similar issue (using a data module) - as far as I can see the tuner only sends the data to GPU in the first iteration. Then the batch size is increased and during the next call of self.fit_loop.run() the skip property of the loop is True, which avoids the whole processing of the model (including sending to GPU) so that the higher batch size is considered ok and the iteration continues. |
do we have any updates on this? Im still running into the same error only when both lr_finder and auto_scale_batch_size are set to true! |
Still running into this issue.
pytorch-lightning version 1.8.6. CUDA out of memory error. |
@noamsgl There won't be a solution in PL soon, as it is caused by the way pytorch reserves and allocates memory, see: pytorch/pytorch#72117 |
Note: still happening as of this date. |
I encounter the same issue, why is this closed? |
I am seeing this still, too. |
🐛 Bug
To Reproduce
When I run:
I get the following output:
Expected behavior
I am expecting the
batch_size
selected by the tuner to fit in the GPU memory but it does not. This is 100% reproducible on my machine.Environment
The text was updated successfully, but these errors were encountered: