Skip to content

[feature-request] interactively add runs without stopping tensorboard #1708

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Hafplo opened this issue Dec 20, 2018 · 6 comments
Open

[feature-request] interactively add runs without stopping tensorboard #1708

Hafplo opened this issue Dec 20, 2018 · 6 comments
Labels
core:backend theme:performance Performance, scalability, large data sizes, slowness, etc. type:feature

Comments

@Hafplo
Copy link

Hafplo commented Dec 20, 2018

We are using tensorboard all the time to monitor our training sessions and to compare between our models.
Since it takes a lot of time to load and read large events files, it would be very helpful to be able to interactively load and remove runs.
Currently the only we know is to terminate tensorboard and run it again with new "--logdir" argument.
Loading all the runs in advance is not an option since:

  1. We have large (~20GB) events files and many runs
  2. Sometimes we want to add a new run that was not created when tensorboard started

Thank you.

@wchargin
Copy link
Contributor

Hi @Hafplo! Thanks for reaching out to us.

TensorBoard finds runs in any descendant directory of the logdir. If you
add a new run under the logdir at runtime, TensorBoard will discover it
and load its data.

It sounds like you’re saying that pointing TensorBoard at your
“main/root logdir” is infeasible because you have tons of data and only
want to view a subset of it at once, for performance reasons. We’re
actively planning the best way to address situations like these (say, a
whole lab group using a shared TensorBoard instance without having to
load all the data into memory ahead of time, instead only loading runs
that you ask for), but TensorBoard doesn’t currently support this.

As a workaround in the meantime, you could create a symlink directory
structure and link the logs that you’re interested into it:

$ cd "$(mktemp -d)"
$ ln -s /data/mnist .
$ ln -s /data/cifar10 .
$ tensorboard --logdir . &

If you want to add a new logdir at runtime, just create a new symlink:

$ ln -s /data/imagenet .

When you’re done, you can just delete the directory (rm -rf should not
delete through symlinks):

$ rm -rvf "$PWD"
removed '/tmp/tmp.E2NQO5y1i2/mnist'
removed '/tmp/tmp.E2NQO5y1i2/cifar10'
removed '/tmp/tmp.E2NQO5y1i2/imagenet'
removed directory '/tmp/tmp.E2NQO5y1i2'
$ kill $!
[1]+  Terminated              tensorboard --logdir .
$ cd -

Does this help?

@wchargin
Copy link
Contributor

rm -rf should not delete through symlinks

I want to clarify this, just because it’s a bit tricky and intrinsically
dangerous: if you rm -rf a symlink directly, it will traverse
through the symlink; if you merely rm -rf a directly containing
symlinks, it will only delete the links themselves. So rm -rf "$PWD"
is safe, but rm -rf "$PWD"/* is not.

@Hafplo
Copy link
Author

Hafplo commented Dec 23, 2018

@wchargin
Thank you for your quick reply and suggestion!
We will try it.

On the same note, is there a way to reduce the size of our events file (perhaps by spliting them and taking only the latest steps)?

@wchargin
Copy link
Contributor

is there a way to reduce the size of our events file (perhaps by
spliting them and taking only the latest steps)?

We don’t currently offer this functionality. As you may know, when
TensorBoard loads your event files into memory, we perform reservoir
sampling to downsample them (to save memory). But we don’t overwrite
your original event files with these downsampled versions or provide an
option to export them.

In the long term, we want to provide more flexibility in this area,
perhaps by allowing you to store summaries in a database instead of flat
files, and exposing utilities for downsampling in place.

@Hafplo
Copy link
Author

Hafplo commented Dec 24, 2018

@wchargin Thank you for your quick responses and insights.

In the long term, we want to provide more flexibility in this area,
perhaps by allowing you to store summaries in a database instead of flat
files, and exposing utilities for downsampling in place.

We'll be looking out for these updates.

Regarding your workaround with symlinks: We are using GCS (Google Cloud Storage) to store our models and event files. According to this answer, symlinks are unavailable for cloud objects.

Following your logic, I thought of creating a dedicated bucket for "current runs" and using gsutil to copy in and out of it. This solution might work if we implement a small tool to handle it. But it sounds cumbersome to go over all this effort just to make use of Tensorboard's finding "descendant directory of the logdir".
Also, we prefer to use logdir with multiple paths and naming shown here. I'm not sure how it will all mix together.

@wchargin
Copy link
Contributor

We are using GCS (Google Cloud Storage) to store our models and event
files

Understood. Yes, it’s also my understanding that GCS does not support
symlinks.

This solution might work if we implement a small tool to handle it.
But it sounds cumbersome to go over all this effort just to make use
of Tensorboard's finding "descendant directory of the logdir".

Agreed. It looks like we don’t really have a good solution for you at
this time.

I’ll keep this feature request open. Thanks again.

@nfelt nfelt added the theme:performance Performance, scalability, large data sizes, slowness, etc. label Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core:backend theme:performance Performance, scalability, large data sizes, slowness, etc. type:feature
Projects
None yet
Development

No branches or pull requests

3 participants