text plugin incorrectly deserializes scalar strings when running python 3 #647

nrhinehart · 2017-10-14T21:46:22Z

Strings with underscores in them aren't being rendered correctly by Tensorboard.

See the SO question here

Package versions: Tensorflow 1.3.0, TensorBoard 0.1.8

import tensorflow as tf
sess = tf.InteractiveSession()
text0 = """/a/b/c_d/f_g_h_2017"""
text1 = """/a/b/c\_d/f\_g\_h\_2017"""
text2 = """/a/b/c\\_d/f\\_g\\_h\\_2017"""

summary_op0 = tf.summary.text('text', tf.convert_to_tensor(text0))
summary_op1 = tf.summary.text('text', tf.convert_to_tensor(text1))
summary_op2 = tf.summary.text('text', tf.convert_to_tensor(text2))
summary_op = tf.summary.merge([summary_op0, summary_op1, summary_op2])
summary_writer = tf.summary.FileWriter('/tmp/tensorboard', sess.graph)
summary = sess.run(summary_op)
summary_writer.add_summary(summary, 0)
summary_writer.flush()
summary_writer.close()

The text was updated successfully, but these errors were encountered:

wchargin · 2017-10-15T00:59:11Z

This is working as intended. The docs for tf.summary.text and also for tensorboard.summary.text state that the text will be rendered using Markdown formatting—just like the text in this issue—and in Markdown, underscores create italics.

If you don't want to be the case, you can consider formatting these strings as code, by using either

text0 = """`/a/b/c_d/f_g_h_2017`"""  # backticks: inline code formatting
text1 = """    /a/b/c\_d/f\_g\_h\_2017"""  # four-space indent: block code formatting

This yields the following result:

I'll cross-post to SO.

nfelt · 2017-10-16T19:30:20Z

(Hey, Nick! Fancy running into you here...)

I'm not sure this is actually WAI. I can't repro this behavior either with a fresh install of tensorflow 1.3.0 (bundled with tensorboard 0.1.8) from PyPI or building tensorboard independently at HEAD and running it against a sample events file from tensorflow nightly. In both cases, all three text samples produce the same expected rendering with underscores shown:

@nrhine1 - could you check what version of the markdown pip package you have installed? I'm a little suspicious that you might have a stale version - prior to #161 our pip requirements file specified markdown==2.2.0 even though we were using 2.6.8 when building from source, and that's what fresh installs should be picking up now.

The python markdown package (like many others including github markdown) now defaults to not italicizing mid-word because it interferes so often with filenames using underscores. This behavior is behind an option (smart_emphasis) but it defaults to True, at least currently. And when it does italicize, backslash escaping the italics should work. See e.g. using pip install -I markdown==2.68:

>>> import markdown
>>> markdown.markdown("""/a/b/c_d/f_g_h_2017""")
u'<p>/a/b/c_d/f_g_h_2017</p>'
>>> markdown.markdown("""/a/b/c\_d/f\_g\_h\_2017""")
u'<p>/a/b/c_d/f_g_h_2017</p>'
>>> markdown.markdown("""/a/b/c\\_d/f\\_g\\_h\\_2017""")
u'<p>/a/b/c_d/f_g_h_2017</p>'
>>> markdown.markdown("""_a/b_""")  # italics shown
u'<p><em>a/b</em></p>'
>>> markdown.markdown("""\_a/b\_""")  # italics escaped
u'<p>_a/b_</p>'

So I'm at a bit of a loss why you're getting the "non-smart" behavior.

nrhinehart · 2017-10-16T20:08:00Z

Hi @nfelt (indeed funny running into you here :)

I was running with 2.6.9:

$ python -c 'import markdown; print(markdown.__version__.version); print(markdown.markdown("""/a/b/c_d/f_g_h_2017"""))'
2.6.9
<p>/a/b/c_d/f_g_h_2017</p>

I installed 2.6.8 as suggested, and also printed the markdown version inside the tensorboard python executable to ensure it was 2.6.8 as expected. The resulting parsed Markdown in tensorboard still treats the underscores as emphasis.

However, using backticks to escape the string as previously suggested does result in the expected rendering.

I've run pip freeze on my virtual environment and attached the output, in case it helps debug.
requirements.txt

I've also tried explicitly passing smart_emphasis=True inside the markdown_to_safe_html() function, but still observed the same result (without "smart emphasis")

nfelt · 2017-10-16T20:40:09Z

Hmm, strange. Are you able to reproduce it in a fresh virtual environment with only tensorflow/tensorboard and deps installed? I'm also wondering a little if the presence of the DLMC tensorboard package could be interfering somehow.

nrhinehart · 2017-10-16T21:28:06Z

I am able to reproduce the error in a fresh virtualenv. I created a new python 3.4.3 virtualenv, activated it, ran pip install tensorflow-gpu, cleaned up the old /tmp/tensorboard/* events, and re-ran the script. I visualized the new result with the same emphasis error.

I doubt it's relevant (and hope it's not), but I'm running the script and tensorboard remotely on a server, and visualizing the result in my local browser.

New result of pip freeze:

bleach==1.5.0
html5lib==0.9999999
Markdown==2.6.9
numpy==1.13.3
protobuf==3.4.0
six==1.11.0
tensorflow-gpu==1.3.0
tensorflow-tensorboard==0.1.8
Werkzeug==0.12.2

nfelt · 2017-10-17T21:39:24Z

Okay, thanks for the additional information - I could also repro this in a python 3 environment and it turns out to be a python 2-3 compat issue stemming from some TensorBoard code that works only in python 2. I'll revise the issue title accordingly and this should be a pretty simple fix.

Essentially, the issue is the following line of code:
https://github.com/tensorflow/tensorboard/blob/0.1.8/tensorboard/plugins/text/text_plugin.py#L170

The use of np.dtype(str) is unsafe because it means "null-terminated bytes" in python 2 and "little-endian UTF-32" in python 3. This becomes an actual problem because of the use of asstring() which is just a further source of confusion - it's been renamed asbytes() in recent numpy because it actually returns "the raw data bytes in the array". The result is the following behavior:

python2> np.array('foo', np.dtype(str)).tostring()
'foo'
python3> np.array('foo', np.dtype(str)).tostring()
b'f\x00\x00\x00o\x00\x00\x00o\x00\x00\x00'

In python 2, tostring() == the raw bytes which works because numpy's storage format matches python's. In python 3, the raw bytes don't match python - every character is followed by 3 null bytes due to the UTF-32 encoding. This in turn causes the markdown library to see non-word characters surrounding every letter in the string, which means that intra-word underscores like f_g_h_2017 are no longer detected as "word-surrounded underscores" by the smart_emphasis feature and thus do get treated as italics. The null bytes evidently get dropped before HTML rendering, so nothing else broke.

This fixes #647 in which TextPlugin incorrectly decodes scalar string values as their raw numpy byte representation before passing them to `markdown_to_safe_html()`. When running python 2 this happens to work because numpy and Python use the same raw byte representation for string data, but in python 3 this is not the case, and the result is a string with spurious null bytes, leading the markdown processor to interpret certain patterns incorrectly. See the issue for details. This both fixes the TextPlugin decoding logic itself, and also fixes `markdown_to_safe_html()` so that it will discard null bytes and log a warning about them before doing markdown interpretation. Tested under python 2 and 3 via `bazel test --python_path=...`.

This fixes tensorflow#647 in which TextPlugin incorrectly decodes scalar string values as their raw numpy byte representation before passing them to `markdown_to_safe_html()`. When running python 2 this happens to work because numpy and Python use the same raw byte representation for string data, but in python 3 this is not the case, and the result is a string with spurious null bytes, leading the markdown processor to interpret certain patterns incorrectly. See the issue for details. This both fixes the TextPlugin decoding logic itself, and also fixes `markdown_to_safe_html()` so that it will discard null bytes and log a warning about them before doing markdown interpretation. Tested under python 2 and 3 via `bazel test --python_path=...`.

This fixes #647 in which TextPlugin incorrectly decodes scalar string values as their raw numpy byte representation before passing them to `markdown_to_safe_html()`. When running python 2 this happens to work because numpy and Python use the same raw byte representation for string data, but in python 3 this is not the case, and the result is a string with spurious null bytes, leading the markdown processor to interpret certain patterns incorrectly. See the issue for details. This both fixes the TextPlugin decoding logic itself, and also fixes `markdown_to_safe_html()` so that it will discard null bytes and log a warning about them before doing markdown interpretation. Tested under python 2 and 3 via `bazel test --python_path=...`.

wchargin closed this as completed Oct 15, 2017

nfelt reopened this Oct 17, 2017

nfelt changed the title ~~tf.summary.text underscores displayed incorrectly~~ text plugin incorrectly deserializes scalar strings when running python 3 Oct 17, 2017

nfelt self-assigned this Oct 17, 2017

nfelt added plugin:text type:bug labels Oct 17, 2017

nfelt mentioned this issue Oct 31, 2017

Fix TextPlugin scalar string decoding errors in py3 #685

Merged

nfelt closed this as completed in #685 Oct 31, 2017

nfelt mentioned this issue Nov 7, 2017

Tensorboard tf.summary.text mangles markdown linebreaks #222

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text plugin incorrectly deserializes scalar strings when running python 3 #647

text plugin incorrectly deserializes scalar strings when running python 3 #647

nrhinehart commented Oct 14, 2017

wchargin commented Oct 15, 2017

nfelt commented Oct 16, 2017

nrhinehart commented Oct 16, 2017 •

edited

Loading

nfelt commented Oct 16, 2017

nrhinehart commented Oct 16, 2017 •

edited

Loading

nfelt commented Oct 17, 2017

text plugin incorrectly deserializes scalar strings when running python 3 #647

text plugin incorrectly deserializes scalar strings when running python 3 #647

Comments

nrhinehart commented Oct 14, 2017

wchargin commented Oct 15, 2017

nfelt commented Oct 16, 2017

nrhinehart commented Oct 16, 2017 • edited Loading

nfelt commented Oct 16, 2017

nrhinehart commented Oct 16, 2017 • edited Loading

nfelt commented Oct 17, 2017

nrhinehart commented Oct 16, 2017 •

edited

Loading

nrhinehart commented Oct 16, 2017 •

edited

Loading