[feature-request] interactively add runs without stopping tensorboard #1708

Hafplo · 2018-12-20T09:01:35Z

We are using tensorboard all the time to monitor our training sessions and to compare between our models.
Since it takes a lot of time to load and read large events files, it would be very helpful to be able to interactively load and remove runs.
Currently the only we know is to terminate tensorboard and run it again with new "--logdir" argument.
Loading all the runs in advance is not an option since:

We have large (~20GB) events files and many runs
Sometimes we want to add a new run that was not created when tensorboard started

Thank you.

wchargin · 2018-12-20T19:15:49Z

Hi @Hafplo! Thanks for reaching out to us.

TensorBoard finds runs in any descendant directory of the logdir. If you
add a new run under the logdir at runtime, TensorBoard will discover it
and load its data.

It sounds like you’re saying that pointing TensorBoard at your
“main/root logdir” is infeasible because you have tons of data and only
want to view a subset of it at once, for performance reasons. We’re
actively planning the best way to address situations like these (say, a
whole lab group using a shared TensorBoard instance without having to
load all the data into memory ahead of time, instead only loading runs
that you ask for), but TensorBoard doesn’t currently support this.

As a workaround in the meantime, you could create a symlink directory
structure and link the logs that you’re interested into it:

$ cd "$(mktemp -d)"
$ ln -s /data/mnist .
$ ln -s /data/cifar10 .
$ tensorboard --logdir . &

If you want to add a new logdir at runtime, just create a new symlink:

$ ln -s /data/imagenet .

When you’re done, you can just delete the directory (rm -rf should not
delete through symlinks):

$ rm -rvf "$PWD"
removed '/tmp/tmp.E2NQO5y1i2/mnist'
removed '/tmp/tmp.E2NQO5y1i2/cifar10'
removed '/tmp/tmp.E2NQO5y1i2/imagenet'
removed directory '/tmp/tmp.E2NQO5y1i2'
$ kill $!
[1]+  Terminated              tensorboard --logdir .
$ cd -

Does this help?

wchargin · 2018-12-20T19:27:06Z

rm -rf should not delete through symlinks

I want to clarify this, just because it’s a bit tricky and intrinsically
dangerous: if you rm -rf a symlink directly, it will traverse
through the symlink; if you merely rm -rf a directly containing
symlinks, it will only delete the links themselves. So rm -rf "$PWD"
is safe, but rm -rf "$PWD"/* is not.

Hafplo · 2018-12-23T07:44:42Z

@wchargin
Thank you for your quick reply and suggestion!
We will try it.

On the same note, is there a way to reduce the size of our events file (perhaps by spliting them and taking only the latest steps)?

wchargin · 2018-12-23T08:29:14Z

is there a way to reduce the size of our events file (perhaps by
spliting them and taking only the latest steps)?

We don’t currently offer this functionality. As you may know, when
TensorBoard loads your event files into memory, we perform reservoir
sampling to downsample them (to save memory). But we don’t overwrite
your original event files with these downsampled versions or provide an
option to export them.

In the long term, we want to provide more flexibility in this area,
perhaps by allowing you to store summaries in a database instead of flat
files, and exposing utilities for downsampling in place.

Hafplo · 2018-12-24T08:54:56Z

@wchargin Thank you for your quick responses and insights.

In the long term, we want to provide more flexibility in this area,
perhaps by allowing you to store summaries in a database instead of flat
files, and exposing utilities for downsampling in place.

We'll be looking out for these updates.

Regarding your workaround with symlinks: We are using GCS (Google Cloud Storage) to store our models and event files. According to this answer, symlinks are unavailable for cloud objects.

Following your logic, I thought of creating a dedicated bucket for "current runs" and using gsutil to copy in and out of it. This solution might work if we implement a small tool to handle it. But it sounds cumbersome to go over all this effort just to make use of Tensorboard's finding "descendant directory of the logdir".
Also, we prefer to use logdir with multiple paths and naming shown here. I'm not sure how it will all mix together.

wchargin · 2018-12-25T01:59:14Z

We are using GCS (Google Cloud Storage) to store our models and event
files

Understood. Yes, it’s also my understanding that GCS does not support
symlinks.

This solution might work if we implement a small tool to handle it.
But it sounds cumbersome to go over all this effort just to make use
of Tensorboard's finding "descendant directory of the logdir".

Agreed. It looks like we don’t really have a good solution for you at
this time.

I’ll keep this feature request open. Thanks again.

wchargin added type:feature core:backend labels Dec 25, 2018

nfelt added the theme:performance Performance, scalability, large data sizes, slowness, etc. label Dec 17, 2019

This was referenced Apr 29, 2021

RustBoard daemon mode #4923

Open

Fast data loading feedback (--load_fast=true; “RustBoard”) #4784

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature-request] interactively add runs without stopping tensorboard #1708

[feature-request] interactively add runs without stopping tensorboard #1708

Hafplo commented Dec 20, 2018

wchargin commented Dec 20, 2018

wchargin commented Dec 20, 2018

Hafplo commented Dec 23, 2018

wchargin commented Dec 23, 2018

Hafplo commented Dec 24, 2018

wchargin commented Dec 25, 2018

[feature-request] interactively add runs without stopping tensorboard #1708

[feature-request] interactively add runs without stopping tensorboard #1708

Comments

Hafplo commented Dec 20, 2018

wchargin commented Dec 20, 2018

wchargin commented Dec 20, 2018

Hafplo commented Dec 23, 2018

wchargin commented Dec 23, 2018

Hafplo commented Dec 24, 2018

wchargin commented Dec 25, 2018