Skip to content

HParam table view sort not working #3041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
patzm opened this issue Dec 17, 2019 · 22 comments
Open

HParam table view sort not working #3041

patzm opened this issue Dec 17, 2019 · 22 comments

Comments

@patzm
Copy link

patzm commented Dec 17, 2019

Environment information

Diagnostics

Diagnostics output
--- check: autoidentify
INFO: diagnose_tensorboard.py version 66d35fe98ca66dc3a5ae600631a8aa6bce785bc5

--- check: general                                                                                     
INFO: sys.version_info: sys.version_info(major=3, minor=6, micro=9, releaselevel='final', serial=0)    
INFO: os.name: posix                                                                                   
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='tensorboard', release='5.0.0-1026-gcp', version='#27~18.04.1-Ubuntu SMP Fri Nov 15 07:40:39 UTC 2019', machine='x86_64')
INFO: sys.getwindowsversion(): N/A                                                                     
                                                                                                       
--- check: package_management                                                                          
INFO: has conda-meta: False                                                                            
INFO: $VIRTUAL_ENV: None                                                                               
                                                                                                       
--- check: installed_packages
INFO: installed: tensorboard==2.0.0
INFO: installed: tensorflow==1.13.1
INFO: installed: tensorflow-estimator==1.13.0

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.0.0'

--- check: tensorflow_python_version
/home/patzm/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/patzm/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/patzm/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/patzm/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/patzm/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/patzm/.local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
INFO: tensorflow.__version__: '1.13.1'
INFO: tensorflow.__git_version__: "b'v1.13.1-0-g6612da8951'"

--- check: tensorboard_binary_path
INFO: which tensorboard: b'/home/patzm/.local/bin/tensorboard\n'

--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>
Loopback infos: [(<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::1', 0, 0, 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>                                                            
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]

--- check: readable_fqdn
INFO: socket.getfqdn(): 'tensorboard.c.tensorflow-training-176320.internal'

--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info                                                                
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=145168, st_dev=2049, st_nlink=2, st_uid=1005, st_gid=1006, st_size=4096, st_atime=1576537836, st_mtime=1576576519, st_ctime=1576576519)
INFO: mode: 0o40777

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/home/patzm/.local/lib/python3.6/site-packages']; bad_roots (0): []

--- check: full_pip_freeze

netifaces==0.10.4
numpy==1.17.2
oauthlib==2.0.6
PAM==0.4.2
parso==0.5.1
pbr==5.1.1
pip==9.0.1
pluggy==0.12.0
protobuf==3.10.0
pyasn1==0.4.2
pyasn1-modules==0.2.1
pycrypto==2.6.1
pygobject==3.26.1
PyJWT==1.5.3
pynvim==0.3.2
pyOpenSSL==17.5.0
pyserial==3.4
python-apt==1.6.4
python-debian==0.1.32
python-jsonrpc-server==0.2.0
python-language-server==0.28.1
pyxdg==0.25
PyYAML==3.12
requests==2.18.4
requests-unixsocket==0.1.5
SecretStorage==2.3.1
service-identity==16.0.0
setuptools==41.4.0
six==1.12.0
ssh-import-id==5.7
stevedore==1.30.0
systemd-python==234
tensorboard==2.0.0
tensorflow==1.13.1
tensorflow-estimator==1.13.0
termcolor==1.1.0
Twisted==17.9.0
ufw==0.36
unattended-upgrades==0.1
urllib3==1.22
virtualenv==16.1.0
virtualenv-clone==0.4.0
virtualenvwrapper==4.8.2
Werkzeug==0.16.0
wheel==0.33.6
zipp==0.5.2
zope.interface==4.3.2

For browser-related issues, please additionally specify:

  • Firefox 71.0 on Manjaro Linux 64 bit
  • Google Chrome Version 79.0.3945.79 (Official Build) (64-bit)

Issue description

In the Table View
image
sorting only works sporadically. Most of the time, it doesn't work at all, other times, sorting is applied to the previously selected column. This seems to be independent of whether you first select the sorting direction and then the sort by drop down, or otherwise round. This seems to be independent of the number of trials. Currently, I observe that for around ~50 runs.
On the server side that hosts the TensorBoard, I don't observe CPU core utilization or RAM saturation.

@rmothukuru
Copy link

@patzm,
Can you please let us know if you have set the Direction (Ascending or Descending) as well. If the issue still persists, can you please share the Screenshot, which demonstrates the issue. Thanks!

@patzm
Copy link
Author

patzm commented Dec 18, 2019

@rmothukuru, yes I am using the Direction drop down menu to select one of the two sorting direction choices.

Selection of sorting column and direction:
image

The very same column in the table view:
image

As you can see, these values are clearly not sorted. Sadly, I can't share the whole screenshot. If this is sufficient, I would like to avoid the effort of writing a reproducible script. Hopefully, this already helps you to narrow down the issue.

I refreshed the site as well with the browser refresh button, and the TensorBoard refresh trigger (upper right). Neither of which helped. Also, I tried various versions of TensorBoard: 1.15.0, 2.0.0, 2.0.2, 2.1.0. It is also noteworthy, that the model_dirs (of each trial) are quite huge. TensorBoard uses ~20GB of RAM for those ~50 trials.

@wchargin
Copy link
Contributor

wchargin commented Dec 18, 2019

Hi @patzm! Thanks for the report and detailed background information.
I’m trying to reproduce this on a dataset with 50 runs, and having
trouble; the sorting seems to be working fine for me in both Chrome and
Firefox, on 64-bit Debian-based Linux.

I’m assuming from your screenshot that for each trial you have runs
named eval and train (or something), each with accuracy/top_1
summaries. Here’s the script that I’m using to generate test data:

# Context: <https://github.com/tensorflow/tensorboard/issues/3041>

import os
import random

import tensorflow as tf
from tensorboard.plugins.hparams import api as hp

HP_LR = hp.HParam("learning_rate", hp.RealInterval(0.01, 0.1))
HP_OPTIMIZER = hp.HParam("optimizer", hp.Discrete(["adam", "sgd"]))

SESSIONS = ("train", "eval")

METRIC_LOSS = "loss"
METRIC_ACC1 = "accuracy/top_1"
METRIC_ACC5 = "accuracy/top_5"

NUM_RUNS = 50
NUM_STEPS = 10
BASE_LOGDIR = "logs"


def main():
    rng = random.Random(0)

    for i in range(NUM_RUNS):
        session_dir = os.path.join(BASE_LOGDIR, "%03d" % i)
        with tf.summary.create_file_writer(session_dir).as_default():
            hp.hparams(
                {h: h.domain.sample_uniform(rng) for h in (HP_LR, HP_OPTIMIZER)}
            )
        for session in SESSIONS:
            logdir = os.path.join(session_dir, session)
            with tf.summary.create_file_writer(logdir).as_default():
                for step in range(NUM_STEPS):
                    tf.summary.scalar(METRIC_LOSS, rng.random(), step=step)
                    tf.summary.scalar(METRIC_ACC1, rng.random(), step=step)
                    tf.summary.scalar(METRIC_ACC5, rng.random(), step=step)


if __name__ == "__main__":
    main()

Changing the sort key and direction always seems to work fine for me.

Can you check and see if the data generated by that script works in your
environment? If not, we can try to track down the environmental
difference. If so, we’ll probably want to get your example down to a
minimal repro that you can share.

FWIW, I notice from your diagnostics that you’re running TensorBoard 2.0
with TensorFlow 1.13. This isn’t an officially supported configuration
(TensorBoard and TensorFlow need to have the same minor version), so you
could try upgrading TensorFlow (1.15 should be compatible with 1.13, and
then you can test in TensorBoard 1.15), but I would be a bit surprised
if that were the root cause here.

@wchargin
Copy link
Contributor

(NB: I just edited the script in the above comment.)

@patzm
Copy link
Author

patzm commented Dec 20, 2019

you’re running TensorBoard 2.0 with TensorFlow 1.13

I deleted the whole TensorBoard setup after submitting the issue and started from scratch. I created 3 virtual environments:

  • TensorFlow 1.15.0 and TensorBoard 1.15.0
  • TensorFlow 2.0.0 and TensorBoard 2.0.2
  • TensorFlow 2.0.0 with TensorBoard 2.1.0

This didn't solve the sorting problem. So your intuition was right 😉 .

@patzm
Copy link
Author

patzm commented Dec 20, 2019

I ran your script with a minor change. I inserted the following and replaced all tf.* usages with tf_v2.*:

import tensorflow as tf

tf.compat.v1.enable_eager_execution()
tf_v2 = tf.compat.v2

This allowed me to run your script both in TensorFlow 1.15 and TensorFlow 2.0. In both cases, TensorBoard (Hparams) with the respectively matching version worked as expected. I.e. I can reproduce your results.

Then I was running the same script with the BASE_LOGDIR set to a google bucket, e.g. gs://bucket-name/logs. Here, I tested with

  • TensorFlow 1.15.0 and TensorBoard 1.15.0
  • TensorFlow 2.0.0 with TensorBoard 2.0.2

and it also worked in both cases.

@patzm
Copy link
Author

patzm commented Dec 20, 2019

I just created a debug repo in which I just pushed the slightly modified demo script in patzm/tensorboard-3041@14e5a74. I will write a minimal example for my use case there. I am still working with TensorFlow 1.15 and relying on the non-eager, graph-based stuff. I wrote a custom HParamWriter, which uses TensorFlow 1.x summary writers. I will post again here once it is done.

@patzm
Copy link
Author

patzm commented Dec 20, 2019

for each trial you have runs named eval and train (or something)

Yes, almost: the eval runs are in the sub-folder of the train runs. The train runs are stored in the model_dir. Could that be a problem?

@patzm
Copy link
Author

patzm commented Dec 20, 2019

Ok @wchargin, can you try to run the two scripts in my repo? I added small # TODO(wchargin) for you:

  1. tb_debug.py is essentially your example from above
  2. custom_debug.py contains my implementation of the TensorFlow v1 compatible HParamWriter. Note that I am not using the _write_v2 implementation that it provides.

I would suggest that you do 4 runs. Run your script twice, once for each of the two MODES alternatives. Do the same for my script. I tested this with TensorFlow 1.15.

I think I have tracked down the issue to two things:

  1. I am using TensorFlow v1 summary writers / API
  2. My train model dir is the parent of the eval model dir.

If those two things are in place, I observe the following:

  • sometimes, none, or only a few runs appear in the HParams tab.
  • TensorBoard is complaining that it found multiple graphs and meta graphs:
    W1220 14:42:34.325282 140550538139392 plugin_event_accumulator.py:294] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
    W1220 14:42:34.325443 140550538139392 plugin_event_accumulator.py:302] Found more than one metagraph event per run. Overwriting the metagraph with the newest event.
    

I am curious to see if you can reproduce this. Sadly, this isn't exactly what this issue reported initially, but I have a strong feeling that this is related.

@patzm
Copy link
Author

patzm commented Dec 20, 2019

I just ran the same thing with the BASE_LOGDIR as a Google bucket. Now, all combinations work and TensorBoard did not complain about multiple (meta) graphs. 🤔 How would you continue debugging this?

@phemmer
Copy link

phemmer commented Jan 13, 2020

So I'm experiencing this behavior as well. However for me the sorting is applied to the field that is sequentially following the one I selected.
For example, lets say the sorting dropdown shows 3 options, "A", "B", and "C". If I set tensorboard to sort by "A", it'll sort by "B". If I tell it to sort by "B", it'll sort by "C". If I tell it to sort by "C", it actually doesn't do anything. It stays sorted by whatever was previously selected.
Note that I have a lot more than 3 though (55 if I counted correctly). But the behavior is consistent with the pattern described.

This is using Tensorboard 2.1.0 & Tensorflow 2.1.0

@patzm
Copy link
Author

patzm commented Jan 13, 2020

I experienced that as well sometimes. My observation was that it seemed to select the previously selected column. Maybe it is a different issue though.

@tqfjo
Copy link

tqfjo commented Jan 15, 2020

I'm experiencing similar behavior. So far, sorting never changes no matter what I do. It always remains sorted by Trial ID, ascending. The issue remains even when I have only eg 5 trials, regardless of the sorting direction I choose, the sorting column I choose, and the browser (Firefox v71.0 or Chrome 68.0.3440).

I am using torch.utils.tensorboard.

Diagnostics

Diagnostics output
--- check: autoidentify
INFO: diagnose_tensorboard.py version d515ab103e2b1cfcea2b096187741a0eeb8822ef

--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=7, micro=6, releaselevel='final', serial=0)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename=<private>, release='4.15.0-72-generic', version='#81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019', machine='x86_64')
INFO: sys.getwindowsversion(): N/A

--- check: package_management
INFO: has conda-meta: True
INFO: $VIRTUAL_ENV: None

--- check: installed_packages
INFO: installed: tensorboard==2.1.0
INFO: installed: tensorflow==2.1.0
INFO: installed: tensorflow-estimator==2.1.0

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.1.0'

--- check: tensorflow_python_version
2020-01-15 13:57:37.249308: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-9.0/lib64:/usr/local/cuda/extras/CUPTI/lib64
2020-01-15 13:57:37.249453: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-9.0/lib64:/usr/local/cuda/extras/CUPTI/lib64
2020-01-15 13:57:37.249463: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
INFO: tensorflow.__version__: '2.1.0'
INFO: tensorflow.__git_version__: 'v2.1.0-rc2-17-ge5bf8de'

--- check: tensorboard_binary_path
INFO: which tensorboard: b'/home/<private>/miniconda3/envs/<private>/bin/tensorboard\n'

--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>
Loopback infos: [(<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::1', 0, 0, 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]

--- check: readable_fqdn
INFO: socket.getfqdn(): 'localhost.localdomain'

--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=15212576, st_dev=66305, st_nlink=2, st_uid=1000, st_gid=1000, st_size=4096, st_atime=1576874874, st_mtime=1579116229, st_ctime=1579116229)
INFO: mode: 0o40777

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/home/<private>/miniconda3/envs/<private>/lib/python3.7/site-packages']; bad_roots (0): []

--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==0.8.1
ansiwrap==0.8.4
appdirs==1.4.3
astor==0.8.0
attrs==19.3.0
ax-platform==0.1.8
backcall==0.1.0
black==19.10b0
bleach==3.1.0
botorch==0.2.0
Bottleneck==1.2.1
cachetools==3.1.1
-e git+https://github.com/pytorch/captum.git@b46536b2f92dc609972e5a5cf125aec5c7f81e79#egg=captum
certifi==2019.11.28
cffi==1.13.2
chardet==3.0.4
Click==7.0
cycler==0.10.0
decorator==4.4.1
defusedxml==0.6.0
docutils==0.15.2
entrypoints==0.3
Flask==1.1.1
future==0.18.2
gast==0.2.2
google-auth==1.7.0
google-auth-oauthlib==0.4.1
google-pasta==0.1.8
gpytorch==1.0.0
grpcio==1.25.0
h5py==2.10.0
humanize==0.5.1
idna==2.8
imageio==2.6.1
importlib-metadata==1.3.0
ipdb==0.12.3
ipykernel==5.1.3
ipython==7.10.1
ipython-genutils==0.2.0
ipywidgets==7.5.1
itsdangerous==1.1.0
jedi==0.15.2
Jinja2==2.10.3
joblib==0.14.1
json5==0.8.5
jsonschema==3.2.0
jupyter==1.0.0
jupyter-client==5.3.3
jupyter-console==6.0.0
jupyter-core==4.6.1
jupyterlab==1.2.1
jupyterlab-black==0.2.1
jupyterlab-code-formatter==0.7.0
jupyterlab-server==1.0.6
jupytext==1.2.4
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
line-profiler==2.1.2
Markdown==3.1.1
MarkupSafe==1.1.1
matplotlib==3.1.2
mistune==0.8.4
more-itertools==8.0.2
mypy-extensions==0.4.3
nbconvert==5.6.1
nbformat==4.4.0
notebook==6.0.1
numpy==1.17.3
oauthlib==3.1.0
olefile==0.46
opt-einsum==3.1.0
packaging==19.2
pandas==0.25.3
pandocfilters==1.4.2
parso==0.5.2
pathspec==0.7.0
patsy==0.5.1
-e git+https://github.com/SauceCat/PDPbox/@73c69665f1663b53984e187c7bc8996e25fea18e#egg=PDPbox
pexpect==4.7.0
pickleshare==0.7.5
Pillow==6.2.1
pip==19.3.1
pkginfo==1.5.0.1
plotly==4.4.1
prometheus-client==0.7.1
prompt-toolkit==3.0.2
protobuf==3.10.0
psutil==5.6.7
ptyprocess==0.6.0
pyasn1==0.4.7
pyasn1-modules==0.2.7
pycparser==2.19
Pygments==2.5.2
pynvml==8.0.3
pyparsing==2.4.6
PyQt5==5.12.3
PyQt5-sip==4.19.18
PyQtWebEngine==5.12.1
pyrsistent==0.15.6
python-dateutil==2.8.1
-e git+https://github.com/pytorch/ignite.git@f85c6483c00e4e3b125c7f21d953c0bf4f34de7e#egg=pytorch_ignite
pytz==2019.3
PyYAML==5.2
pyzmq==18.1.1
qtconsole==4.6.0
readme-renderer==24.0
regex==2019.12.20
requests==2.22.0
requests-oauthlib==1.3.0
requests-toolbelt==0.9.1
retrying==1.3.3
rsa==4.0
ruamel.yaml==0.16.5
ruamel.yaml.clib==0.2.0
scikit-learn==0.21.0
scipy==1.4.1
seaborn==0.9.0
Send2Trash==1.5.0
setuptools==44.0.0.post20200102
six==1.13.0
sklearn==0.0
skorch==0.7.0
statsmodels==0.10.2
tabulate==0.8.6
tenacity==6.0.0
tensorboard==2.1.0
tensorflow==2.1.0
tensorflow-estimator==2.1.0
termcolor==1.1.0
terminado==0.8.3
test-tube==0.7.3
testpath==0.4.4
textwrap3==0.9.2
toml==0.10.0
torch==1.3.1
torchvision==0.4.2
tornado==6.0.3
tqdm==4.35.0
traitlets==4.3.3
twine==1.13.0
typed-ast==1.4.0
typing-extensions==3.7.4.1
urllib3==1.25.6
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.16.0
wheel==0.33.6
widgetsnbextension==3.5.1
wrapt==1.11.2
xarray==0.14.0
xlrd==1.2.0
zipp==0.6.0

@patzm
Copy link
Author

patzm commented Feb 10, 2020

@wchargin, how to you want to proceed on this? The issue still persists in all deployment cases I am dealing with. Would love to solve this / see it solved.

@wchargin
Copy link
Contributor

Hi @patzm—if you’re still experiencing this issue, could you check
whether there are any pending or failed network requests when you change
the sort order?

Changing the sort order actually triggers a network request (because the
frontend only shows the top k runs, so changing the ordering criterion
can cause the set of runs displayed to change). I didn’t realize this
before because I generally run TensorBoard as a web server on my local
machine, but if you’re connecting to your TensorBoard instance over a
remote network then it’s possible that that factors into the problem.
This could also be consistent with your observation that sometimes it
seems to select the previously selected column: perhaps network
requests are resolving out of order.

Could you share a few details about your network topology? Where your
TensorBoard instance is located with respect to your browser, how many
other people are accessing it, etc. If you don’t generally connect to
localhost, could you please try seeing if you can reproduce the issue
when connected to localhost?

@JulianFerry
Copy link

JulianFerry commented May 6, 2020

I'm having the same issue, sorting mostly doesn't work at all, or else it is very random. I tested this on runs with 2 trials and 1 epoch (stored locally) as well as 8 trials and 5 epochs (on google cloud).

I'm using Chrome on MacOS 10.15.4 with package versions:

tensorboard==2.1.1
tensorflow==2.1.0

Running TensorBoard on localhost.

@wchargin
Copy link
Contributor

wchargin commented May 6, 2020

Running TensorBoard on localhost.

Hmm, interesting. If you’re up for a bit of debugging, would you mind
performing the following steps?

  1. Launch TensorBoard pointing at the “2 trials, 1 epoch, stored
    locally” logdir.
  2. Navigate to the new TensorBoard instance in Firefox or Chrome.
  3. Open your browser’s dev tools and select the “Network” panel.
  4. Use the hparams controls pane to change the sort column and sort
    direction.
  5. Observe in the network panel whether a new request is fired to the
    session_groups endpoint, as in this screenshot.
  6. Observe whether that request finishes quickly, finishes slowly, or
    fails.
  7. Confirm whether the UI properly updates the sort state as you expect
    or not.

My suspicion is that it’s this session_groups request that’s slow or
flaky and causing the problem. Sorry to have to ask you to poke around,
but we just haven’t been able to get a consistent repro for this. If
it’s easy for you to take a video/screencast, that’d be excellent; if
not, any screenshots or descriptions would be helpful. Thank you!

@georgeadam
Copy link

georgeadam commented May 11, 2020

I am having a very similar problem, and it seems to only be a problem with some columns, namely the metrics columns. However, this depends on the total number of columns

image

The requests to the server are fine in terms of completion.

I will note that this problem also occurs when trying to filter values based on min-max range for a column where the sorting doesn't work. Interestingly enough, if you filter values before trying to sort, it works. If you try after, it no longer works, likely due to the client-side bugs that are messing with the table.

@tshadley
Copy link

tshadley commented Aug 14, 2021

I observe this issue but only when I have at least one hparam that is a string. That string creates an off-by-one miscalculation in the session_groups Request Payload judging by the position of the "order:" key.

@PavelBezzub
Copy link

I also observe a similar problem when the number of Param string parameters becomes greater than a certain number. When adding additional parameters, the sorting is shifted two columns to the right, when sorting the third column, the fifth is sorted.

@TangJiakai
Copy link

still has this problem...

@bmd3k
Copy link
Contributor

bmd3k commented Dec 19, 2022

Hi @TangJiakai ,
Thanks for bringing this back to our attention. I hadn't personally noticed this report before.

TensorBoard 2.11.0 should contain #5971 which may address some of the issues here.

Anybody who is still following along or comes here in the future, could you:

  1. Try TensorBoard 2.11.0 or later.
  2. If your problem persists, please post some more details about your problem or consider opening an entirely new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests