-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Large memory consumption 0.4 #766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I observe the same behaviour, especially when there's a lot of data, be it many small experiments, or few long running ones. I've had 64GB memory systems starting to swap after a while when opening 2-3 such tensorboards. |
Same observation here. |
What's the progress of this issue now? @jart can you elaborate on the reason for this problem? |
Any chance you guys could post |
Hi @jart , here's the shell output of |
Checking in, I'm also getting this in spades and I have to kill tensorboard at least once a day to keep it from grinding everything to a halt. |
Here's one, this only got to about 2GB RAM before I shut it down. Other instances have gotten to 10GB as others have reported. |
Same here (currently at 6GB). Is there a flag to disable loading the graph for example? |
Hi, I am also observing this behavior. Is this fixed in the 1.5 version? |
Any updates? Or anybody found a workaround for this problem? |
In tensorboard 1.5 the issue is still there. Memory consumption is increasing steadily at ~10MB per second.. Here is the output of tensorboard --inspect --logdir mylogdir |
I am having this same issue. The model is a simple LSTM that uses a pre-trained 600k x 300 dimension word embedding. I have 16 model versions and Tensorboard quickly consumes all 64Gb of memory on my machine. I am running Tensorboard 1.5. Here is the inspection log. |
What helped in my case was never saving the graph. Make sure you do not add the graph somewhere and also pass graph=None to the FileWriter. Not a real solution, but maybe it helps. |
+1 |
any news on this? |
We're currently working on having a DB storage layer that puts information like the graphdef on disk rather than in memory. We'd be happy to accept a contribution that, for example, adds a flag to not load the GraphDef into memory, or perhaps saves a pointer to its file in memory to load it on demand, since the GraphDef is usually the very first thing inside an event log file. |
Unfortunately |
I'm also experiencing this issue with TensorBoard 1.9. Evicting GraphDef from memory might be an okay short-term solution but it's a fixed size, so it should only save a constant amount of memory. The problem for me is memory growth over time. @jart is someone actively looking into this issue? It's fine if the answer is no, just want to understand where things are. Also, is there any additional information the community can provide to help diagnose what's going on? |
I'm having the same thing from tensorboard 1.10 |
I also have the same thing from tensorboard 1.12. Later I use an alternative measure:
Kill and restart the |
I also meet this problem with 70+GB :( |
Guess what? I encounter the same issue, the only difference here is that I ran |
Yeah I'm confused why nobody cares about this issue.
A memory leak of this magnitude make that tool basically useless.
…On Fri, Feb 15, 2019, 00:11 rex-yue-wu ***@***.***> wrote:
Guess what? I encounter the same issue, the only difference here is that I
ran tensorboard on a server with 512GB memory, and yeah tensorboard ate
all memory!!!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#766 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACPN_r2wGJyPoXdpSGE66JljhNmAp7xXks5vNe0cgaJpZM4QnPHP>
.
|
Hi folks - we're trying to get to the bottom of this, and we're sorry it's been such a longstanding problem. For those of you on the thread who have experienced this, it would really help if you can comment with the following information:
|
Hi, I was about to open a new issue for this but found you're already working on it. In my case, Tensorboard used 12GB of RAM and 20% of my CPU resources. I'll provide the details you asked for.
Additional: DiagnosticsDiagnostics output
Next stepsNo action items identified. Please copy ALL of the above output, |
Assigning this to @nfelt who is actively looking into this. Please reassign or unassign as appropriate. |
Quick update everyone - we think we've narrowed this down to a memory leak in In terms of a fix, it appears that by pure coincidence a change landed in TensorFlow yesterday that replaces the leaking code, so in our testing we're seeing at least a much lower rate of memory leakage when running TensorBoard against today's If you're still seeing the issue, please try running TensorBoard in an environment with that version of TensorFlow (the actual version of TensorFlow you use for generating the log data should not affect this) and let us know if it seems to resolve the issue or not. We will see what we can do to try to work around the issue so that we can get a fix to you sooner than the next TF release that would include yesterday's change (2.2) - if possible we'll see if we can fix this on the TB side so that those who can't easily update TF to the most recent version have access to a fix. |
@nfelt hi this is good news, thanks. Curious though: are you planning for an independent tensorboard build with this issue fixed? |
Hi all, I'm running tensorboard without tensorflow, and I no longer experience the huge memory consumptio. |
Tried to use the
I noticed that if I add a lot of files inside of the logdir folder, TensorBoard throws an exception:
The memory issue happens without adding a lot of small files inside the logdir as well, but since there is this recursive process opening a lot of files, it might be one of the root causes of this quick memory growth that happens upon starting it (as related to |
Just to add another comment, if I run:
As suggested by @adizhol, TensorBoard works fine and takes only 310MB of resident memory, which really seems to solve the issue. So it seems that this is definitely caused by tensorflow code. It gives the warning:
Which seems to limit the available features on TensorBoard. |
Just adding more info, I think I found the culprit. If you just use (on
To force it to use the It changes the memory usage from 16GB to around ~500MB. |
I have 16 GB RAM and also suffer from this memory leak problem. After executing tensorboard through command prompt (Windows 10), it shows:
W11-LSTM64-FC16L0D0-Run_0 is the name I assigned for my architecture and I ran this approximately one month ago, roughly equivalent to 200 models ago. I did shut down my PC and restart all the process, so this is not because I keep the PC running. There are lots of lines that shows the same "Unable to get first event timestamp". After I moved all the old logs, the lines stopped showing and the problem with memory leak seems to disappear as well. I don't really know what happened but I guess @ismael-elatifi 's guess is correct. |
One practical, easy way that I tried last night is that I right-clicked on the C drive and chose Properties. Then, I chose to clean up and selected Temporary files to be deleted from my computer. |
This reads a single event file from start to end, parsing the frame of each TFRecord, but not inspecting the payload at all. Unfortunately, the TFRecord reading is already ~3× slower than RustBoard’s entire pipeline, and ~13× slower than RustBoard’s TFRecord reading. :-( Discussion here is on a 248 MiB event file, the `egraph_edge_cgan_003` run from a user-supplied log directory: <tensorflow/tensorboard#766 (comment)> The effect of the buffer size is a bit strange. As expected, buffering definitely helps (~3× improvement with default buffer size), and the improvements taper off as the buffer size increases: 4 KiB and 1 MiB are about the same. But then in the 4 MiB to 8 MiB range we start to see sharp improvements: 1 MiB to 4 MiB is no change, but 4 MiB to 8 MiB is 25% faster. The improvements continue even up to 128 or 256 MiB on a file that’s 248 MiB long. Compare to RustBoard, which sees similar effects at low buffer sizes but no extra improvements for very large buffers. (This all running with hot filesystem cache.) Buffer size sweep: <https://gist.github.com/wchargin/b73b5af3ef36b88e4e1aacf9a2453ea6> CPU profiling shows that about 35% of time is spent in `make([]byte)` in `ExtendTo`, which seems unfortunate, since due to the actual access patterns we barely overallocate there (overallocating only 12 bytes per record), so it’s not obvious how to avoid that cost. Another 50% of total time is spent in `runtime.mallocgc`. And 20% of total time (not necessarily disjoint) is spent in the `result := TFRecord{...}` final allocation in `ReadRecord`, which is surprising to me since it just has two small fields (a slice header and a `uint32`) and they’ve already been computed above. (Inlining effects?) Checksum validation is fast when enabled; runtime increases by ~10%.
The next version of TensorBoard loads much faster (~100× throughput) and TL;DR: Update to latest |
I've been experiencing OOM and sigkills when using pytorch and tensorboard. I cannot guarantee unfortuantely that it is tensorboard doing the error but thought it would be good to mention it and give a reference: |
@brando90 Sorry but the stackoverflow is not related to this repository at all. For the summary writer for PyTorch, please seek help from https://github.com/lanpa/tensorboardX. |
I'm also seeing a steady increase in memory usage of about 10MiB per hour which seems to go on forever. I'm using the tensorflow 2.5.0 docker image (with uses the new data loader) and logs are stored on an s3 minio service. However, memory keeps increasing even when there's no extra logs added. I guess something is not freed properly in this reload loop: https://github.com/tensorflow/tensorboard/blob/56be365/tensorboard/backend/event_processing/data_ingester.py#L93 |
Is there a solution to this problem now? |
For people who can use it, we are recommending using https://pypi.org/project/tensorboard-data-server/ which should make log ingestion faster and memory not blow up. |
Hi, Tensorboard is eating up alot of RAM (0.5GB/s on startup, and the system becomes unusable after a few mins) when log_dir contains event files that contain training batch images saved per epoch to tensorboard log_dir... we think its related to this bug...has this issue been resolved? |
@ArfaSaif are you able to use |
@stephanwlee Is there documentation for tensorboard data server? I don't see any on the pypi page, nor in the project subdirectory. |
probably the DEVELOPMENT.md in that same directory, and rustboard.md are your best bets. |
Here the tensorboard can eat up to teens of GB RAM using the latest version |
I have just upgraded to Tensorflow 1.4 and Tensorboard 0.4. I had Tensorboard running for 20 hours. It was consuming 10GB of memory. I shut it down and restarted. It memory consumption is increasing steadily at ~10MB per second.
The text was updated successfully, but these errors were encountered: