-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Feature Request: Support for 1000s of logs #1013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the feedback! Re: support for 1000s of runs - this is an area where we could improve, though it's a little challenging given how things are currently structured, and it is a bit of an outlier among current usage. Are there particular actions or interactions you noticed being sluggish? E.g. reload vs hovering over the chart vs changing which runs are selected, etc.? Also, when you say that you have a single time step per run, does that mean you're really trying to create something more like a scatter plot? We could open a feature request for that, and having a scatter plot might be a quicker route to good performance than reworking the scalar dashboard to handle this case well. Re: the hover chart, I agree that it would be useful to have a way to "pin" it so that you can copy-paste the data (and view it if it's too large to fit on screen). Would you mind opening a separate FR for that? I think it's useful functionality even with just a few runs so I'd like to track that separately. Re: hyperparameters, that's definitely a feature request we've gotten before (#46) and we'll hopefully have better support for that soon. |
I agree it is an outlier at the moment, but even without hyperopt the number of runs often does add up quickly. To my knowledge, hyperopt is also considered best practice so I may end up being an early adopter of a growing population. I also know of others in my university that moved away from tensorboard due to the same limitations.
There is no major problem with the plot as it is now, I found the relative and wall plots worked well enough. |
Okay, thanks for the details and filing that separate feature request! Re: # 2, what do you mean exactly by struggle? If you mean it doesn't render right a screenshot would be great. Re # 3 we have gotten some reports of memory leaks or high memory consumption, e.g. #766, though thus far I believe we haven't really had enough time to thoroughly investigate and drill down on how that's happening. As you saw we are working on providing better support for hyperparameter optimization experiments, and I agree that part of that should be supporting ~1000 ish runs with much less of a performance hit than we have today. |
It prints lots of stack traces about uncaught exceptions and missing data. Sorry I don't have one on hand at the moment, I'll try to add it here when I come across it. |
I don't know that this is much of an outlier any more. Especially now that the hparams plugin is part natively integrated. I myself am running into difficulties with large numbers of runs (a few hundred in my case) performing hyperparameter searches. My experience is that the hparams plugin seems to just freak out and stops showing anything after 150 runs or so. And the scalars screen seems to break after around 200 runs. While tensorboard does suck down a ton of memory (~25gb), I've got plenty to spare (workstation has 64gb), it seems the main issue is that it starts pegging the CPU at 100% (just one core). Smells like it can't keep up with the incoming logs. |
I'm really enjoying tensorboard, it makes my life a lot easier and gives me a good idea of what's happening as training progresses.
Unfortunately, I've found one area of weakness shows up when many runs accumulate, because tensorboard starts to become very sluggish. Here is a video from a 22 core xenon workstation with 48GB ram and about 50% of resources free trying to display about ~250 runs in tensorboard, each with a single time step (epoch):
What would help:
Thanks for your consideration!
The text was updated successfully, but these errors were encountered: