Tensorboard server freezing #1915

evictor · 2019-02-11T23:08:07Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04.4 LTS (Xenial Xerus) Linux a1006d7babbc 4.15.0-1037-azure Video summary support #39-Ubuntu SMP Tue Jan 15 10:24:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
TensorFlow installed from (source or binary): Docker base image tensorflow/tensorflow:1.9.0-py3
TensorFlow version (use command below): 1.9.0
Python version: 3.5.2
CUDA/cuDNN version: N/A
GPU model and memory: none

Describe the current behavior

After several hours of running Tensorboard server will stop responding to requests. It is running in a Docker container and has been put on its own VM after narrowing the problem down to Tensorboard. It will eventually cause a seizure of the VM itself, which will then have to be restarted.

WORTH NOTING: The Tensorboard logs are stored in Azure File Storage and mounted via CIFS. Then that mounted dir is mounted in Tensorboard Docker container using Docker volume mount. The Tensorboard log dir has ~460 subdirs and a total of ~45,000 event files amounting to about 100 MB. There is also a subdir of archived logs (i.e. in several dozen *.tgz files) amounting to 2 GB, but Tensorboard should in theory not be reading or messing with those huge files (right?).

Describe the expected behavior

Tensorboard should be able to run ad infinitum without seizing up its host.

Other info / logs

I know this is a nebulous ticket with many variables. Please let me know what information I can possibly provide, especially if you can direct me to install special debug versions, etc. as needed so I can collect logs from Tensorboard itself.

jvishnuvardhan · 2019-02-12T18:55:32Z

@evictor Correct. Tensorboard doesn't mess with those huge files, it only collect data from your TF model run. There could be several reason why Tensorboard is freezing.

Did you try the same model with smaller dataset? Is it working well without any freezing?
Is it possible for you to use current Tensorboard and Tensorflow versions instead of old versions? Newer versions are much better in handling many of the issue reported with old versions.
Is it possible for you to create a code to reproduce the bug? Just to make sure that there is no errors in selecting parameters of tensorboard ops.
Please let us know what you think. Thanks!

jvishnuvardhan · 2019-02-22T23:14:31Z

Closing this out since I understand it to be resolved, but please let me know if I'm mistaken.Thanks!

evictor · 2019-02-25T19:35:51Z

I have performed the upgrades to latest versions re: your point #2. I will let you know if anything new comes up, thanks

jvishnuvardhan · 2019-02-25T20:40:26Z

Thanks @evictor. If you see the issue persists with the newer versions, we will open the issue and solve it. If you see any issue, please provide a code to reproduce the bug so that we can have solve the issue faster. Thanks again!

evictor · 2019-02-27T23:14:00Z

I have found evidence of OOM including in the updated versions... It eventually reaches ~8 GB.

nfelt · 2019-02-28T00:36:02Z

@jvishnuvardhan this sounds like it belongs in the TensorBoard issue tracker. Can we get a repo admin to transfer this to https://github.com/tensorflow/tensorboard?

@evictor is there any particular evidence you've found of the OOM, or anything you've been able to do to narrow down the problem? We get reports of this sometimes but they've been hard to reproduce. For recent versions of TensorBoard you should be able to pass --verbosity 1 to get log output.

evictor · 2019-02-28T01:24:34Z

Thanks, I am now running with --verbosity 1. I found several of these in system serial log:

[246558.494090] Out of memory: Kill process 5165 (tensorboard) score 918 or sacrifice child
[246558.538030] Killed process 5165 (tensorboard) total-vm:9122176kB, anon-rss:7482400kB, file-rss:0kB, shmem-rss:0kB

I would like to reiterate that I have TB log dir set to an SMB (CIFS) mounted disk. I wonder if that has anything to do with a potential leak? Pure conjecture but maybe TB is attempting to read, acquire file lock, etc. and failing exceptionally if the CIFS mount is sporadically unreadable or lagging?

I also noticed that it sometimes has persistent locks on files/folders as I am unable to "archive" them (I zip them and store them elsewhere, then try to remove the folder with TB logs from mounted disk but it reveals a lock).

nfelt · 2019-02-28T01:31:29Z

Thanks for the update, we'll have to investigate. May be related to #766.

jhagege · 2019-08-22T09:59:57Z

Thanks, I am now running with --verbosity 1. I found several of these in system serial log:
[246558.494090] Out of memory: Kill process 5165 (tensorboard) score 918 or sacrifice child
[246558.538030] Killed process 5165 (tensorboard) total-vm:9122176kB, anon-rss:7482400kB, file-rss:0kB, shmem-rss:0kB
I would like to reiterate that I have TB log dir set to an SMB (CIFS) mounted disk. I wonder if that has anything to do with a potential leak? Pure conjecture but maybe TB is attempting to read, acquire file lock, etc. and failing exceptionally if the CIFS mount is sporadically unreadable or lagging?

I also noticed that it sometimes has persistent locks on files/folders as I am unable to "archive" them (I zip them and store them elsewhere, then try to remove the folder with TB logs from mounted disk but it reveals a lock).

@evictor did you find a solution to this issue ?
We encountered a "Bad file descriptor" when SummaryWriter is writing directly on a NAS (NFS) mounted folder, which seem to happen only when we are working on 4xGPU or 8xGPU machine.

evictor · 2019-09-09T17:09:55Z

Running Tensorboard from latest release of Tensorflow seemed to help! In fact I will close this now as I am no longer having that issue...

jvishnuvardhan self-assigned this Feb 12, 2019

jvishnuvardhan closed this as completed Feb 22, 2019

jvishnuvardhan reopened this Feb 27, 2019

jvishnuvardhan assigned nfelt and unassigned jvishnuvardhan Feb 27, 2019

martinwicke transferred this issue from tensorflow/tensorflow Feb 28, 2019

nfelt added the type:bug label Feb 28, 2019

Harshini-Gadige added the stat:awaiting tensorflower label Mar 1, 2019

evictor closed this as completed Sep 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorboard server freezing #1915

Tensorboard server freezing #1915

evictor commented Feb 11, 2019

jvishnuvardhan commented Feb 12, 2019 •

edited

Loading

jvishnuvardhan commented Feb 22, 2019

evictor commented Feb 25, 2019

jvishnuvardhan commented Feb 25, 2019

evictor commented Feb 27, 2019

nfelt commented Feb 28, 2019 •

edited

Loading

evictor commented Feb 28, 2019

nfelt commented Feb 28, 2019 •

edited

Loading

jhagege commented Aug 22, 2019

evictor commented Sep 9, 2019

Tensorboard server freezing #1915

Tensorboard server freezing #1915

Comments

evictor commented Feb 11, 2019

jvishnuvardhan commented Feb 12, 2019 • edited Loading

jvishnuvardhan commented Feb 22, 2019

evictor commented Feb 25, 2019

jvishnuvardhan commented Feb 25, 2019

evictor commented Feb 27, 2019

nfelt commented Feb 28, 2019 • edited Loading

evictor commented Feb 28, 2019

nfelt commented Feb 28, 2019 • edited Loading

jhagege commented Aug 22, 2019

evictor commented Sep 9, 2019

jvishnuvardhan commented Feb 12, 2019 •

edited

Loading

nfelt commented Feb 28, 2019 •

edited

Loading

nfelt commented Feb 28, 2019 •

edited

Loading