Skip to content

Tensorboard server freezing #1915

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
evictor opened this issue Feb 11, 2019 · 10 comments
Closed

Tensorboard server freezing #1915

evictor opened this issue Feb 11, 2019 · 10 comments

Comments

@evictor
Copy link

evictor commented Feb 11, 2019

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04.4 LTS (Xenial Xerus) Linux a1006d7babbc 4.15.0-1037-azure Video summary support #39-Ubuntu SMP Tue Jan 15 10:24:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • TensorFlow installed from (source or binary): Docker base image tensorflow/tensorflow:1.9.0-py3
  • TensorFlow version (use command below): 1.9.0
  • Python version: 3.5.2
  • CUDA/cuDNN version: N/A
  • GPU model and memory: none

Describe the current behavior

After several hours of running Tensorboard server will stop responding to requests. It is running in a Docker container and has been put on its own VM after narrowing the problem down to Tensorboard. It will eventually cause a seizure of the VM itself, which will then have to be restarted.

WORTH NOTING: The Tensorboard logs are stored in Azure File Storage and mounted via CIFS. Then that mounted dir is mounted in Tensorboard Docker container using Docker volume mount. The Tensorboard log dir has ~460 subdirs and a total of ~45,000 event files amounting to about 100 MB. There is also a subdir of archived logs (i.e. in several dozen *.tgz files) amounting to 2 GB, but Tensorboard should in theory not be reading or messing with those huge files (right?).

Describe the expected behavior

Tensorboard should be able to run ad infinitum without seizing up its host.

Other info / logs

I know this is a nebulous ticket with many variables. Please let me know what information I can possibly provide, especially if you can direct me to install special debug versions, etc. as needed so I can collect logs from Tensorboard itself.

@jvishnuvardhan jvishnuvardhan self-assigned this Feb 12, 2019
@jvishnuvardhan
Copy link
Contributor

jvishnuvardhan commented Feb 12, 2019

@evictor Correct. Tensorboard doesn't mess with those huge files, it only collect data from your TF model run. There could be several reason why Tensorboard is freezing.

  1. Did you try the same model with smaller dataset? Is it working well without any freezing?
  2. Is it possible for you to use current Tensorboard and Tensorflow versions instead of old versions? Newer versions are much better in handling many of the issue reported with old versions.
  3. Is it possible for you to create a code to reproduce the bug? Just to make sure that there is no errors in selecting parameters of tensorboard ops.
    Please let us know what you think. Thanks!

@jvishnuvardhan
Copy link
Contributor

Closing this out since I understand it to be resolved, but please let me know if I'm mistaken.Thanks!

@evictor
Copy link
Author

evictor commented Feb 25, 2019

I have performed the upgrades to latest versions re: your point #2. I will let you know if anything new comes up, thanks

@jvishnuvardhan
Copy link
Contributor

Thanks @evictor. If you see the issue persists with the newer versions, we will open the issue and solve it. If you see any issue, please provide a code to reproduce the bug so that we can have solve the issue faster. Thanks again!

@evictor
Copy link
Author

evictor commented Feb 27, 2019

I have found evidence of OOM including in the updated versions... It eventually reaches ~8 GB.

@nfelt
Copy link
Contributor

nfelt commented Feb 28, 2019

@jvishnuvardhan this sounds like it belongs in the TensorBoard issue tracker. Can we get a repo admin to transfer this to https://github.com/tensorflow/tensorboard?

@evictor is there any particular evidence you've found of the OOM, or anything you've been able to do to narrow down the problem? We get reports of this sometimes but they've been hard to reproduce. For recent versions of TensorBoard you should be able to pass --verbosity 1 to get log output.

@evictor
Copy link
Author

evictor commented Feb 28, 2019

Thanks, I am now running with --verbosity 1. I found several of these in system serial log:

[246558.494090] Out of memory: Kill process 5165 (tensorboard) score 918 or sacrifice child
[246558.538030] Killed process 5165 (tensorboard) total-vm:9122176kB, anon-rss:7482400kB, file-rss:0kB, shmem-rss:0kB

I would like to reiterate that I have TB log dir set to an SMB (CIFS) mounted disk. I wonder if that has anything to do with a potential leak? Pure conjecture but maybe TB is attempting to read, acquire file lock, etc. and failing exceptionally if the CIFS mount is sporadically unreadable or lagging?

I also noticed that it sometimes has persistent locks on files/folders as I am unable to "archive" them (I zip them and store them elsewhere, then try to remove the folder with TB logs from mounted disk but it reveals a lock).

@martinwicke martinwicke transferred this issue from tensorflow/tensorflow Feb 28, 2019
@nfelt nfelt added the type:bug label Feb 28, 2019
@nfelt
Copy link
Contributor

nfelt commented Feb 28, 2019

Thanks for the update, we'll have to investigate. May be related to #766.

@jhagege
Copy link

jhagege commented Aug 22, 2019

Thanks, I am now running with --verbosity 1. I found several of these in system serial log:

[246558.494090] Out of memory: Kill process 5165 (tensorboard) score 918 or sacrifice child
[246558.538030] Killed process 5165 (tensorboard) total-vm:9122176kB, anon-rss:7482400kB, file-rss:0kB, shmem-rss:0kB

I would like to reiterate that I have TB log dir set to an SMB (CIFS) mounted disk. I wonder if that has anything to do with a potential leak? Pure conjecture but maybe TB is attempting to read, acquire file lock, etc. and failing exceptionally if the CIFS mount is sporadically unreadable or lagging?

I also noticed that it sometimes has persistent locks on files/folders as I am unable to "archive" them (I zip them and store them elsewhere, then try to remove the folder with TB logs from mounted disk but it reveals a lock).

@evictor did you find a solution to this issue ?
We encountered a "Bad file descriptor" when SummaryWriter is writing directly on a NAS (NFS) mounted folder, which seem to happen only when we are working on 4xGPU or 8xGPU machine.

@evictor
Copy link
Author

evictor commented Sep 9, 2019

Running Tensorboard from latest release of Tensorflow seemed to help! In fact I will close this now as I am no longer having that issue...

@evictor evictor closed this as completed Sep 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants