-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Escape control chars even if emitting UTF8 #1178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Escape control chars even if emitting UTF8 #1178
Conversation
src/lib_json/json_writer.cpp
Outdated
@@ -309,31 +317,25 @@ static String valueToQuotedStringN(const char* value, unsigned length, | |||
// Should add a flag to allow this compatibility mode and prevent this | |||
// sequence from occurring. | |||
default: { | |||
unsigned codepoint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know, uninitialized variables are bad practice, even if in the current code a value will always be assigned, this might not be the case in future revisions. Even the compilers are issuing warnings for such variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not worried about it.
I know they're issuing warnings if the variable is read before initialization, but here the compiler should be able to prove codepoint to be safe. I don't want to have to invent some value like =0 here, that's IMO worse because it shows the reader a value that looks like it has a meaning but it doesn't. It also opens us up to the same class of maintenance bugs in it would be trying to avoid. If the variable somehow escapes without being overwritten, it will survive with the invalid sentinel value. I could wrap the codepoint initializer in an immediately-invoked-lambda, but I think that's just too clever for this kind of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
= 0
is not needed, you can just use braced initialization {}
which will not give wrong impression to the reader.
What you say does have a point, but in case the variable gets used without an initialization, you will be having buggy random behavior(based on what random value it have), while when initialized you will have buggy, but fixed behavior which is easier to track. :)
src/lib_json/json_writer.cpp
Outdated
|
||
if (codepoint < 0x20) { | ||
appendHex(result, codepoint); | ||
} else if (codepoint < 0x80 || emitUTF8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have double checking for emitUTF8
. First on initialization and second here. My PR had only one such check, Not a big deal, just mentioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't like it either. It's clearer to separate the emitUTF8 path. Thanks.
src/test_lib_json/main.cpp
Outdated
}; | ||
|
||
Json::StreamWriterBuilder b; | ||
b.settings_["emitUTF8"] = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also add a validation with that setting disabled, just to make sure the encoding is proper and prevent future regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/lib_json/json_writer.cpp
Outdated
} else { | ||
// Extended Unicode. Encode 20 bits as a surrogate pair. | ||
codepoint -= 0x10000; | ||
appendHex(result, 0xd800 + ((codepoint >> 10) & 0x3ff)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using & 0x3ff
is not really needed for the first code point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defense in depth, here. It's an interesting question. We can only write 20 bits of the 32 bit codepoint, as a limitation of json's encoding. We would need to know that codepoint has no bits above bit 20 set, or the UTF-8 output would be broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. For some reason I'm unable to resolve my comments, but anyway, I think this can be merged.
P.S. When can I expect a new release containing these changes? I don't want to be pushy but I really don't like to wait several months for simple fixes. :)
See #1176
Fixes #1175