Skip to content

Escape control chars even if emitting UTF8 #1178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

BillyDonahue
Copy link
Contributor

See #1176
Fixes #1175

@coveralls
Copy link

coveralls commented May 21, 2020

Coverage Status

Coverage increased (+0.01%) to 93.834% when pulling 0b6a3a5 on BillyDonahue:continue1176_minimal into 75b360a on open-source-parsers:master.

@@ -309,31 +317,25 @@ static String valueToQuotedStringN(const char* value, unsigned length,
// Should add a flag to allow this compatibility mode and prevent this
// sequence from occurring.
default: {
unsigned codepoint;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know, uninitialized variables are bad practice, even if in the current code a value will always be assigned, this might not be the case in future revisions. Even the compilers are issuing warnings for such variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not worried about it.
I know they're issuing warnings if the variable is read before initialization, but here the compiler should be able to prove codepoint to be safe. I don't want to have to invent some value like =0 here, that's IMO worse because it shows the reader a value that looks like it has a meaning but it doesn't. It also opens us up to the same class of maintenance bugs in it would be trying to avoid. If the variable somehow escapes without being overwritten, it will survive with the invalid sentinel value. I could wrap the codepoint initializer in an immediately-invoked-lambda, but I think that's just too clever for this kind of code.

Copy link
Contributor

@TheStormN TheStormN May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

= 0 is not needed, you can just use braced initialization {} which will not give wrong impression to the reader.
What you say does have a point, but in case the variable gets used without an initialization, you will be having buggy random behavior(based on what random value it have), while when initialized you will have buggy, but fixed behavior which is easier to track. :)


if (codepoint < 0x20) {
appendHex(result, codepoint);
} else if (codepoint < 0x80 || emitUTF8) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have double checking for emitUTF8. First on initialization and second here. My PR had only one such check, Not a big deal, just mentioning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't like it either. It's clearer to separate the emitUTF8 path. Thanks.

};

Json::StreamWriterBuilder b;
b.settings_["emitUTF8"] = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also add a validation with that setting disabled, just to make sure the encoding is proper and prevent future regressions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

} else {
// Extended Unicode. Encode 20 bits as a surrogate pair.
codepoint -= 0x10000;
appendHex(result, 0xd800 + ((codepoint >> 10) & 0x3ff));
Copy link
Contributor

@TheStormN TheStormN May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using & 0x3ff is not really needed for the first code point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defense in depth, here. It's an interesting question. We can only write 20 bits of the 32 bit codepoint, as a limitation of json's encoding. We would need to know that codepoint has no bits above bit 20 set, or the UTF-8 output would be broken.

Copy link
Contributor

@TheStormN TheStormN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. For some reason I'm unable to resolve my comments, but anyway, I think this can be merged.

P.S. When can I expect a new release containing these changes? I don't want to be pushy but I really don't like to wait several months for simple fixes. :)

@BillyDonahue BillyDonahue merged commit c161f4a into open-source-parsers:master May 21, 2020
@BillyDonahue BillyDonahue deleted the continue1176_minimal branch May 21, 2020 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Control characters are not escaped when emitUTF8 = true
4 participants