Fixed control chars escaping with enabled emitUTF8 #1176

TheStormN · 2020-05-19T13:23:22Z

coveralls · 2020-05-19T13:26:33Z

Coverage decreased (-0.03%) to 93.791% when pulling 0c6deca on TheStormN:fix-control-chars-encoding-in-utf8emit-mode into 75b360a on open-source-parsers:master.

dota17 · 2020-05-20T11:55:26Z

src/lib_json/json_writer.cpp

@@ -310,7 +310,12 @@ static String valueToQuotedStringN(const char* value, unsigned length,
    // sequence from occurring.
    default: {
      if (emitUTF8) {
-        result += *c;
+        if (isControlCharacter(*c)) {


Good catch.
But I think there is no need to add more method.
You can refer these code : https://github.com/nlohmann/json/blob/a861bdb81f1c5ef0adc270952001096b189dbe9e/include/nlohmann/detail/output/serializer.hpp#L406-#L428

emitUTF maybe not a good name.

From the code you are referring I guess you mean that I should embed that code as part of the current one which is doing the utf8 encoding?

And yeah, encodeUTF8 would be much better name, but regardless of the encoding, control and special characters should be encoded/escaped.

Ok, I've squeezed it a bit, but utf8ToCodepoint should not be called when we don't want to do special processing for utf-8, because it offsets the pointer.

BillyDonahue · 2020-05-20T16:02:28Z

src/lib_json/json_writer.cpp

-        } else if (codepoint <
-                   FIRST_SURROGATE_PAIR_CODEPOINT) { // codepoint is in Basic
-                                                     // Multilingual Plane
+      const unsigned int codepoint = emitUTF8 ? *c : utf8ToCodepoint(c, end);


Bug: *c needs to be replaced with static_cast<unsigned char>(*c).

Please take out all the added 'u' numeric literal suffixes. They aren't doing anything and they're adding diff to this change.

Remove the const here so we can keep the surrogate pair encoding clear.

Condition flip caused code to move around more than we need. I do appreciate the de-Yoda-fication on the lower bound part of the condition though.

It's all so unclear! How about we simplify it.

unsigned codepoint; if (emitUTF8) { codepoint = static_cast<unsigned char>(*c); } else { codepoint = utf8ToCodepoint(c, end); // modifies `c` } if (codepoint >= 0x20 && codepoint <= 0x7f) { result.append(static_cast<char>(cp)); } else { auto appendHexChar = [&result](unsigned ch) { result.append("\\u").append(toHex16Bit(ch)); }; if (codepoint < 0x10000) { // Basic Multilingual Plane appendHexChar(codepoint); } else { // Extended Unicode. Encode 20 bits as a surrogate pair. codepoint -= 0x10000; appendHexChar(0xd800 + (codepoint >> 10) & 0x3ff); appendHexChar(0xdc00 + codepoint & 0x3ff); } }

The code you've proposed does not pass the unit tests. I've tried to fix but it wasn't something obvious. If you can do it, I'll be glad to close this PR and let you fix it.

I did some rework on my own code. Check if you like it more now, or if you can fix your example.

I fixed my example, at your invitation, (PR #1178)
I'd like to go with my version of the control flow here.

I also added a pretty good test we should integrate.

We all have our approaches and opinions on what's best. At first I wanted to fix the bug with minimal diff, as you can see in my first commit, later a several refactoring approaches were pointed to which I tried to comply...

Anyway I mostly care about the result, so as long as your PR is merged, this one can be closed.

BillyDonahue · 2020-05-20T16:37:07Z

src/lib_json/json_writer.cpp

          result += "\\u";
-          result += toHex16Bit((codepoint >> 10) + 0xD800);
+          result += toHex16Bit((codepoint >> 10u) + 0xD7C0u);


I prefer the previous formulation where the codepoint is modified and then encoded. I guess that this was done so codepoint could be declared const, and I don't agree with that tradeoff.

Well, the const wasn't the only reason, it was actually one operation (minus) less. I don't see what's the point to do it in 2 steps, but for the sake of this bugfix I don't care much so I will change it as you want.

src/lib_json/json_writer.cpp

See open-source-parsers#1176 Fixes open-source-parsers#1175

* Escape control chars even if emitting UTF8 See #1176 Fixes #1175 * review comments * fix test by stopping early enough to punt on utf8-input.

TheStormN · 2020-05-21T17:52:13Z

Closed in favor of #1178

Fixed control chars escaping with enabled emitUTF8

6973244

dota17 reviewed May 20, 2020

View reviewed changes

TheStormN requested a review from dota17 May 20, 2020 15:32

BillyDonahue requested changes May 20, 2020

View reviewed changes

TheStormN requested a review from BillyDonahue May 20, 2020 20:32

TheStormN added 2 commits May 20, 2020 23:41

Code-review fixes

d268648

Code-review fixes 2

8a1417e

dota17 reviewed May 21, 2020

View reviewed changes

src/lib_json/json_writer.cpp Show resolved Hide resolved

dota17 reviewed May 21, 2020

View reviewed changes

src/lib_json/json_writer.cpp Show resolved Hide resolved

BillyDonahue added a commit to BillyDonahue/jsoncpp that referenced this pull request May 21, 2020

continue open-source-parsers#1176

d058e37

BillyDonahue added a commit to BillyDonahue/jsoncpp that referenced this pull request May 21, 2020

continue open-source-parsers#1176

a350959

BillyDonahue added a commit to BillyDonahue/jsoncpp that referenced this pull request May 21, 2020

Escape control chars even if emitting UTF8 open-source-parsers#1176

42bd7db

BillyDonahue added a commit to BillyDonahue/jsoncpp that referenced this pull request May 21, 2020

Escape control chars even if emitting UTF8

a632154

See open-source-parsers#1176 Fixes open-source-parsers#1175

TheStormN requested a review from dota17 May 21, 2020 08:42

BillyDonahue mentioned this pull request May 21, 2020

Escape control chars even if emitting UTF8 #1178

Merged

Code-review fixes 3

0c6deca

dota17 approved these changes May 21, 2020

View reviewed changes

BillyDonahue added a commit that referenced this pull request May 21, 2020

Escape control chars even if emitting UTF8 (#1178)

c161f4a

* Escape control chars even if emitting UTF8 See #1176 Fixes #1175 * review comments * fix test by stopping early enough to punt on utf8-input.

TheStormN closed this May 21, 2020

TheStormN deleted the fix-control-chars-encoding-in-utf8emit-mode branch May 21, 2020 17:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed control chars escaping with enabled emitUTF8 #1176

Fixed control chars escaping with enabled emitUTF8 #1176

TheStormN commented May 19, 2020

coveralls commented May 19, 2020 •

edited

Loading

dota17 May 20, 2020

TheStormN May 20, 2020

TheStormN May 20, 2020 •

edited

Loading

BillyDonahue May 20, 2020

TheStormN May 20, 2020

TheStormN May 20, 2020

BillyDonahue May 21, 2020

TheStormN May 21, 2020 •

edited

Loading

BillyDonahue May 20, 2020

TheStormN May 20, 2020 •

edited

Loading

TheStormN commented May 21, 2020

Fixed control chars escaping with enabled emitUTF8 #1176

Fixed control chars escaping with enabled emitUTF8 #1176

Conversation

TheStormN commented May 19, 2020

coveralls commented May 19, 2020 • edited Loading

dota17 May 20, 2020

Choose a reason for hiding this comment

TheStormN May 20, 2020

Choose a reason for hiding this comment

TheStormN May 20, 2020 • edited Loading

Choose a reason for hiding this comment

BillyDonahue May 20, 2020

Choose a reason for hiding this comment

TheStormN May 20, 2020

Choose a reason for hiding this comment

TheStormN May 20, 2020

Choose a reason for hiding this comment

BillyDonahue May 21, 2020

Choose a reason for hiding this comment

TheStormN May 21, 2020 • edited Loading

Choose a reason for hiding this comment

BillyDonahue May 20, 2020

Choose a reason for hiding this comment

TheStormN May 20, 2020 • edited Loading

Choose a reason for hiding this comment

TheStormN commented May 21, 2020

coveralls commented May 19, 2020 •

edited

Loading

TheStormN May 20, 2020 •

edited

Loading

TheStormN May 21, 2020 •

edited

Loading

TheStormN May 20, 2020 •

edited

Loading