-
Notifications
You must be signed in to change notification settings - Fork 3.4k
fix_string_meminit_windows #3854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix_string_meminit_windows #3854
Conversation
This fixes Windows bot tests |
f.write(open(out_file).read()) | ||
f.write('\n') | ||
f.write(open(out_file, 'rb').read()) | ||
nl = '\r\n' if WINDOWS else '\n' # Writing in binary mode, so make sure to generate correct newline output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can just use os.linesep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check.
I wonder if after removing those special characters, this compression is still efficient? cc @gagern |
b3e0074
to
c8b9d6d
Compare
That's a good question. Also, I think we need to research to some specs (which?) to figure out what are the guaranteed safe bytes in .js files. My impression is that it is outright incorrect to have the One thing I notice is that blindly replacing text mode file reads and writes to binary mode reads and writes will lead to trouble on Windows. On Windows we have had the convention of generating |
return re.sub('[\x80-\xff]', escape, s) | ||
s = re.sub('\x00(?![0-9])', '\\0', s) # Don't allow a null byte in the generated output file. | ||
s = re.sub('[\x00-\x1f\x80-\xff]', escape, s) # Don't allow ASCII chars 0-31 or 128-255 in the generated output file. | ||
return s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to apply similar changes to the uglify to_ascii
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that is likely true, but I'm a bit wary to make that in blind. It could be done "for good measure" but I think it's better to defer until there is a problem/test to demonstrate the need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that in #3326 (comment) I detected someone postprocessing the strings, and in #3326 (comment) I identified uglify as the culprit. So there was some setup where I could enable that processing step without even trying. Likely tied to some optimization level. Not sure whether that applies anymore; I've not been updating my emscripten toolbox in some time. If the code still gets run, it should affect Windows builds the same way as the original problem here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, reading that, I notice that we also have that asm2m
tests harness that's commented out, I'm running through that now to make sure it's all good. Hopefully I'll see that same issue.
Regarding the question whether control bytes in general and null bytes in particular are permissible in JavaScript, see #3326 (comment) referencing section 7.8.4 of the ECMA 262 spec. But just because the spec allows something doesn't mean all the tools along the way can actually deal with it. In particular since the spec talks about characters, without regard for encoding, while some file format interpretations might treat control bytes as non-characters. Regarding the efficiency of the compression, testing beats philosophizing. But it depends a lot on what kind of code you're dealing with, in particular the split between text and numbers, and whether those numbers tend to be small or random. For comparison, we could reactivate the idea from #3326 (comment) of XORing every byte with 32, hopefully turning more control bytes into digits and other printable characters than the other way round. On the whole, I very much like the gist of this change. It's something I intended to do for quite a while but never find the time to. Thanks a lot! |
I suppose we should just merge this, and do further measurements/testing later? |
What's the status here? |
c8b9d6d
to
df96b45
Compare
Rebased this on top of latest incoming to avoid very long branches in git history, since there's been a lot of commits in between. No functional changes. Retested that This PR does make the escaping a bit more conservative that raw bytes |
@@ -2943,7 +2943,8 @@ def test_meminit_pairs(self): | |||
args = ["-O2", "--memory-init-file", "0", "-s", "MEM_INIT_METHOD=2", "-s", "ASSERTIONS=1"] | |||
self.btest(d, expected='0', args=args + ["--closure", "0"]) | |||
self.btest(d, expected='0', args=args + ["--closure", "0", "-g"]) | |||
self.btest(d, expected='0', args=args + ["--closure", "1"]) | |||
if not WINDOWS: # Closure compiler fails on Windows when it receives the unexpected characters in memory string initializer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this and the similar thing below still needed, even after this pr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, I'll test and reconfirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it still is, that seems like either a bug we should fix here, or a bug in closure if it behaves differently on different OSes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suspect at the time was that it would be a Closure bug, but I did not dig deep enough to understand and confirm. I'm now running through the asm2m
suite to see what comes up, so I'll see after that if this is still an issue.
…t files on Windows.
…hed line endings.
df96b45
to
a761768
Compare
Looking closer at the The PR also now piggybacks a slightly related helper commit that improves error message prints when our line endings mismatch. @gagern, @kripken: how does this look now? |
Oh, I see the last commit does that, nevermind that comment comment then. |
lgtm |
I can't merge to master today anyhow, so let's merge this now. |
…e#3326" This reverts commit b256af6. Since emscripten-core#3854 started escaping \x1a, this should no longer be needed. Conflicts due to code reformatting had to be resolved manually.
Fix memory string initializer to generate format that is safe for text files. Perform Emscripten optimizer cleanup pass in binary file read and write mode to not get confused by the special characters in the string initializer. Skip closure compiler when memory string initializer is used on Windows, since closure trips over when presented with such input. Related to #3326. cc @gagern .