Escape control chars even if emitting UTF8 #1178

BillyDonahue · 2020-05-21T08:43:51Z

See open-source-parsers#1176 Fixes open-source-parsers#1175

coveralls · 2020-05-21T08:47:01Z

Coverage increased (+0.01%) to 93.834% when pulling 0b6a3a5 on BillyDonahue:continue1176_minimal into 75b360a on open-source-parsers:master.

src/lib_json/json_writer.cpp

TheStormN · 2020-05-21T09:16:07Z

src/lib_json/json_writer.cpp

@@ -309,31 +317,25 @@ static String valueToQuotedStringN(const char* value, unsigned length,
    // Should add a flag to allow this compatibility mode and prevent this
    // sequence from occurring.
    default: {
+      unsigned codepoint;


You know, uninitialized variables are bad practice, even if in the current code a value will always be assigned, this might not be the case in future revisions. Even the compilers are issuing warnings for such variables.

I'm not worried about it.
I know they're issuing warnings if the variable is read before initialization, but here the compiler should be able to prove codepoint to be safe. I don't want to have to invent some value like =0 here, that's IMO worse because it shows the reader a value that looks like it has a meaning but it doesn't. It also opens us up to the same class of maintenance bugs in it would be trying to avoid. If the variable somehow escapes without being overwritten, it will survive with the invalid sentinel value. I could wrap the codepoint initializer in an immediately-invoked-lambda, but I think that's just too clever for this kind of code.

= 0 is not needed, you can just use braced initialization {} which will not give wrong impression to the reader.
What you say does have a point, but in case the variable gets used without an initialization, you will be having buggy random behavior(based on what random value it have), while when initialized you will have buggy, but fixed behavior which is easier to track. :)

TheStormN · 2020-05-21T09:17:18Z

src/lib_json/json_writer.cpp

+
+      if (codepoint < 0x20) {
+        appendHex(result, codepoint);
+      } else if (codepoint < 0x80 || emitUTF8) {


You have double checking for emitUTF8. First on initialization and second here. My PR had only one such check, Not a big deal, just mentioning.

Yeah I don't like it either. It's clearer to separate the emitUTF8 path. Thanks.

TheStormN · 2020-05-21T09:19:01Z

src/test_lib_json/main.cpp

+  };
+
+  Json::StreamWriterBuilder b;
+  b.settings_["emitUTF8"] = true;


You can also add a validation with that setting disabled, just to make sure the encoding is proper and prevent future regressions.

TheStormN · 2020-05-21T09:25:40Z

src/lib_json/json_writer.cpp

+      } else {
+        // Extended Unicode. Encode 20 bits as a surrogate pair.
+        codepoint -= 0x10000;
+        appendHex(result, 0xd800 + ((codepoint >> 10) & 0x3ff));


Using & 0x3ff is not really needed for the first code point.

Defense in depth, here. It's an interesting question. We can only write 20 bits of the 32 bit codepoint, as a limitation of json's encoding. We would need to know that codepoint has no bits above bit 20 set, or the UTF-8 output would be broken.

TheStormN

Looks good. For some reason I'm unable to resolve my comments, but anyway, I think this can be merged.

P.S. When can I expect a new release containing these changes? I don't want to be pushy but I really don't like to wait several months for simple fixes. :)

Escape control chars even if emitting UTF8

a632154

See open-source-parsers#1176 Fixes open-source-parsers#1175

BillyDonahue mentioned this pull request May 21, 2020

Fixed control chars escaping with enabled emitUTF8 #1176

Closed

dota17 reviewed May 21, 2020

View reviewed changes

src/lib_json/json_writer.cpp Show resolved Hide resolved

dota17 approved these changes May 21, 2020

View reviewed changes

TheStormN reviewed May 21, 2020

View reviewed changes

BillyDonahue added 2 commits May 21, 2020 05:43

review comments

c158322

fix test by stopping early enough to punt on utf8-input.

0b6a3a5

TheStormN approved these changes May 21, 2020

View reviewed changes

BillyDonahue merged commit c161f4a into open-source-parsers:master May 21, 2020

BillyDonahue deleted the continue1176_minimal branch May 21, 2020 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Escape control chars even if emitting UTF8 #1178

Escape control chars even if emitting UTF8 #1178

BillyDonahue commented May 21, 2020

coveralls commented May 21, 2020 •

edited

Loading

TheStormN May 21, 2020

BillyDonahue May 21, 2020

TheStormN May 21, 2020 •

edited

Loading

TheStormN May 21, 2020

BillyDonahue May 21, 2020

TheStormN May 21, 2020

BillyDonahue May 21, 2020

TheStormN May 21, 2020 •

edited

Loading

BillyDonahue May 21, 2020

TheStormN left a comment

Escape control chars even if emitting UTF8 #1178

Escape control chars even if emitting UTF8 #1178

Conversation

BillyDonahue commented May 21, 2020

coveralls commented May 21, 2020 • edited Loading

TheStormN May 21, 2020

Choose a reason for hiding this comment

BillyDonahue May 21, 2020

Choose a reason for hiding this comment

TheStormN May 21, 2020 • edited Loading

Choose a reason for hiding this comment

TheStormN May 21, 2020

Choose a reason for hiding this comment

BillyDonahue May 21, 2020

Choose a reason for hiding this comment

TheStormN May 21, 2020

Choose a reason for hiding this comment

BillyDonahue May 21, 2020

Choose a reason for hiding this comment

TheStormN May 21, 2020 • edited Loading

Choose a reason for hiding this comment

BillyDonahue May 21, 2020

Choose a reason for hiding this comment

TheStormN left a comment

Choose a reason for hiding this comment

coveralls commented May 21, 2020 •

edited

Loading

TheStormN May 21, 2020 •

edited

Loading

TheStormN May 21, 2020 •

edited

Loading