Skip to content

text plugin incorrectly deserializes scalar strings when running python 3 #647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nrhinehart opened this issue Oct 14, 2017 · 6 comments · Fixed by #685
Closed

text plugin incorrectly deserializes scalar strings when running python 3 #647

nrhinehart opened this issue Oct 14, 2017 · 6 comments · Fixed by #685

Comments

@nrhinehart
Copy link

Strings with underscores in them aren't being rendered correctly by Tensorboard.

See the SO question here

Package versions: Tensorflow 1.3.0, TensorBoard 0.1.8

import tensorflow as tf
sess = tf.InteractiveSession()
text0 = """/a/b/c_d/f_g_h_2017"""
text1 = """/a/b/c\_d/f\_g\_h\_2017"""
text2 = """/a/b/c\\_d/f\\_g\\_h\\_2017"""

summary_op0 = tf.summary.text('text', tf.convert_to_tensor(text0))
summary_op1 = tf.summary.text('text', tf.convert_to_tensor(text1))
summary_op2 = tf.summary.text('text', tf.convert_to_tensor(text2))
summary_op = tf.summary.merge([summary_op0, summary_op1, summary_op2])
summary_writer = tf.summary.FileWriter('/tmp/tensorboard', sess.graph)
summary = sess.run(summary_op)
summary_writer.add_summary(summary, 0)
summary_writer.flush()
summary_writer.close()

image

@wchargin
Copy link
Contributor

This is working as intended. The docs for tf.summary.text and also for tensorboard.summary.text state that the text will be rendered using Markdown formatting—just like the text in this issue—and in Markdown, underscores create italics.

If you don't want to be the case, you can consider formatting these strings as code, by using either

text0 = """`/a/b/c_d/f_g_h_2017`"""  # backticks: inline code formatting
text1 = """    /a/b/c\_d/f\_g\_h\_2017"""  # four-space indent: block code formatting

This yields the following result:

Screenshot of the resulting text dashboard

I'll cross-post to SO.

@nfelt
Copy link
Contributor

nfelt commented Oct 16, 2017

(Hey, Nick! Fancy running into you here...)

I'm not sure this is actually WAI. I can't repro this behavior either with a fresh install of tensorflow 1.3.0 (bundled with tensorboard 0.1.8) from PyPI or building tensorboard independently at HEAD and running it against a sample events file from tensorflow nightly. In both cases, all three text samples produce the same expected rendering with underscores shown:

image

@nrhine1 - could you check what version of the markdown pip package you have installed? I'm a little suspicious that you might have a stale version - prior to #161 our pip requirements file specified markdown==2.2.0 even though we were using 2.6.8 when building from source, and that's what fresh installs should be picking up now.

The python markdown package (like many others including github markdown) now defaults to not italicizing mid-word because it interferes so often with filenames using underscores. This behavior is behind an option (smart_emphasis) but it defaults to True, at least currently. And when it does italicize, backslash escaping the italics should work. See e.g. using pip install -I markdown==2.68:

>>> import markdown
>>> markdown.markdown("""/a/b/c_d/f_g_h_2017""")
u'<p>/a/b/c_d/f_g_h_2017</p>'
>>> markdown.markdown("""/a/b/c\_d/f\_g\_h\_2017""")
u'<p>/a/b/c_d/f_g_h_2017</p>'
>>> markdown.markdown("""/a/b/c\\_d/f\\_g\\_h\\_2017""")
u'<p>/a/b/c_d/f_g_h_2017</p>'
>>> markdown.markdown("""_a/b_""")  # italics shown
u'<p><em>a/b</em></p>'
>>> markdown.markdown("""\_a/b\_""")  # italics escaped
u'<p>_a/b_</p>'

So I'm at a bit of a loss why you're getting the "non-smart" behavior.

@nrhinehart
Copy link
Author

nrhinehart commented Oct 16, 2017

Hi @nfelt (indeed funny running into you here :)

I was running with 2.6.9:

$ python -c 'import markdown; print(markdown.__version__.version); print(markdown.markdown("""/a/b/c_d/f_g_h_2017"""))'
2.6.9
<p>/a/b/c_d/f_g_h_2017</p>

I installed 2.6.8 as suggested, and also printed the markdown version inside the tensorboard python executable to ensure it was 2.6.8 as expected. The resulting parsed Markdown in tensorboard still treats the underscores as emphasis.

However, using backticks to escape the string as previously suggested does result in the expected rendering.

I've run pip freeze on my virtual environment and attached the output, in case it helps debug.
requirements.txt

I've also tried explicitly passing smart_emphasis=True inside the markdown_to_safe_html() function, but still observed the same result (without "smart emphasis")
image

@nfelt
Copy link
Contributor

nfelt commented Oct 16, 2017

Hmm, strange. Are you able to reproduce it in a fresh virtual environment with only tensorflow/tensorboard and deps installed? I'm also wondering a little if the presence of the DLMC tensorboard package could be interfering somehow.

@nrhinehart
Copy link
Author

nrhinehart commented Oct 16, 2017

I am able to reproduce the error in a fresh virtualenv. I created a new python 3.4.3 virtualenv, activated it, ran pip install tensorflow-gpu, cleaned up the old /tmp/tensorboard/* events, and re-ran the script. I visualized the new result with the same emphasis error.

I doubt it's relevant (and hope it's not), but I'm running the script and tensorboard remotely on a server, and visualizing the result in my local browser.

New result of pip freeze:

bleach==1.5.0
html5lib==0.9999999
Markdown==2.6.9
numpy==1.13.3
protobuf==3.4.0
six==1.11.0
tensorflow-gpu==1.3.0
tensorflow-tensorboard==0.1.8
Werkzeug==0.12.2

@nfelt
Copy link
Contributor

nfelt commented Oct 17, 2017

Okay, thanks for the additional information - I could also repro this in a python 3 environment and it turns out to be a python 2-3 compat issue stemming from some TensorBoard code that works only in python 2. I'll revise the issue title accordingly and this should be a pretty simple fix.

Essentially, the issue is the following line of code:
https://github.com/tensorflow/tensorboard/blob/0.1.8/tensorboard/plugins/text/text_plugin.py#L170

The use of np.dtype(str) is unsafe because it means "null-terminated bytes" in python 2 and "little-endian UTF-32" in python 3. This becomes an actual problem because of the use of asstring() which is just a further source of confusion - it's been renamed asbytes() in recent numpy because it actually returns "the raw data bytes in the array". The result is the following behavior:

python2> np.array('foo', np.dtype(str)).tostring()
'foo'
python3> np.array('foo', np.dtype(str)).tostring()
b'f\x00\x00\x00o\x00\x00\x00o\x00\x00\x00'

In python 2, tostring() == the raw bytes which works because numpy's storage format matches python's. In python 3, the raw bytes don't match python - every character is followed by 3 null bytes due to the UTF-32 encoding. This in turn causes the markdown library to see non-word characters surrounding every letter in the string, which means that intra-word underscores like f_g_h_2017 are no longer detected as "word-surrounded underscores" by the smart_emphasis feature and thus do get treated as italics. The null bytes evidently get dropped before HTML rendering, so nothing else broke.

@nfelt nfelt reopened this Oct 17, 2017
@nfelt nfelt changed the title tf.summary.text underscores displayed incorrectly text plugin incorrectly deserializes scalar strings when running python 3 Oct 17, 2017
@nfelt nfelt self-assigned this Oct 17, 2017
nfelt added a commit that referenced this issue Oct 31, 2017
This fixes #647 in which TextPlugin incorrectly decodes scalar string values as their raw numpy byte representation before passing them to `markdown_to_safe_html()`.  When running python 2 this happens to work because numpy and Python use the same raw byte representation for string data, but in python 3 this is not the case, and the result is a string with spurious null bytes, leading the markdown processor to interpret certain patterns incorrectly.  See the issue for details.

This both fixes the TextPlugin decoding logic itself, and also fixes `markdown_to_safe_html()` so that it will discard null bytes and log a warning about them before doing markdown interpretation.

Tested under python 2 and 3 via `bazel test --python_path=...`.
jart pushed a commit to jart/tensorboard that referenced this issue Oct 31, 2017
This fixes tensorflow#647 in which TextPlugin incorrectly decodes scalar string values as their raw numpy byte representation before passing them to `markdown_to_safe_html()`.  When running python 2 this happens to work because numpy and Python use the same raw byte representation for string data, but in python 3 this is not the case, and the result is a string with spurious null bytes, leading the markdown processor to interpret certain patterns incorrectly.  See the issue for details.

This both fixes the TextPlugin decoding logic itself, and also fixes `markdown_to_safe_html()` so that it will discard null bytes and log a warning about them before doing markdown interpretation.

Tested under python 2 and 3 via `bazel test --python_path=...`.
jart pushed a commit that referenced this issue Nov 1, 2017
This fixes #647 in which TextPlugin incorrectly decodes scalar string values as their raw numpy byte representation before passing them to `markdown_to_safe_html()`.  When running python 2 this happens to work because numpy and Python use the same raw byte representation for string data, but in python 3 this is not the case, and the result is a string with spurious null bytes, leading the markdown processor to interpret certain patterns incorrectly.  See the issue for details.

This both fixes the TextPlugin decoding logic itself, and also fixes `markdown_to_safe_html()` so that it will discard null bytes and log a warning about them before doing markdown interpretation.

Tested under python 2 and 3 via `bazel test --python_path=...`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants