-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Tensorboard server freezing #1915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@evictor Correct. Tensorboard doesn't mess with those huge files, it only collect data from your TF model run. There could be several reason why Tensorboard is freezing.
|
Closing this out since I understand it to be resolved, but please let me know if I'm mistaken.Thanks! |
I have performed the upgrades to latest versions re: your point #2. I will let you know if anything new comes up, thanks |
Thanks @evictor. If you see the issue persists with the newer versions, we will open the issue and solve it. If you see any issue, please provide a code to reproduce the bug so that we can have solve the issue faster. Thanks again! |
I have found evidence of OOM including in the updated versions... It eventually reaches ~8 GB. |
@jvishnuvardhan this sounds like it belongs in the TensorBoard issue tracker. Can we get a repo admin to transfer this to https://github.com/tensorflow/tensorboard? @evictor is there any particular evidence you've found of the OOM, or anything you've been able to do to narrow down the problem? We get reports of this sometimes but they've been hard to reproduce. For recent versions of TensorBoard you should be able to pass |
Thanks, I am now running with
I would like to reiterate that I have TB log dir set to an SMB (CIFS) mounted disk. I wonder if that has anything to do with a potential leak? Pure conjecture but maybe TB is attempting to read, acquire file lock, etc. and failing exceptionally if the CIFS mount is sporadically unreadable or lagging? I also noticed that it sometimes has persistent locks on files/folders as I am unable to "archive" them (I zip them and store them elsewhere, then try to remove the folder with TB logs from mounted disk but it reveals a lock). |
Thanks for the update, we'll have to investigate. May be related to #766. |
@evictor did you find a solution to this issue ? |
Running Tensorboard from latest release of Tensorflow seemed to help! In fact I will close this now as I am no longer having that issue... |
System information
Describe the current behavior
After several hours of running Tensorboard server will stop responding to requests. It is running in a Docker container and has been put on its own VM after narrowing the problem down to Tensorboard. It will eventually cause a seizure of the VM itself, which will then have to be restarted.
WORTH NOTING: The Tensorboard logs are stored in Azure File Storage and mounted via CIFS. Then that mounted dir is mounted in Tensorboard Docker container using Docker volume mount. The Tensorboard log dir has ~460 subdirs and a total of ~45,000 event files amounting to about 100 MB. There is also a subdir of archived logs (i.e. in several dozen *.tgz files) amounting to 2 GB, but Tensorboard should in theory not be reading or messing with those huge files (right?).
Describe the expected behavior
Tensorboard should be able to run ad infinitum without seizing up its host.
Other info / logs
I know this is a nebulous ticket with many variables. Please let me know what information I can possibly provide, especially if you can direct me to install special debug versions, etc. as needed so I can collect logs from Tensorboard itself.
The text was updated successfully, but these errors were encountered: