Skip to content

Fixed control chars escaping with enabled emitUTF8 #1176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Fixed control chars escaping with enabled emitUTF8 #1176

wants to merge 4 commits into from

Conversation

TheStormN
Copy link
Contributor

Fixes #1175

@coveralls
Copy link

coveralls commented May 19, 2020

Coverage Status

Coverage decreased (-0.03%) to 93.791% when pulling 0c6deca on TheStormN:fix-control-chars-encoding-in-utf8emit-mode into 75b360a on open-source-parsers:master.

@@ -310,7 +310,12 @@ static String valueToQuotedStringN(const char* value, unsigned length,
// sequence from occurring.
default: {
if (emitUTF8) {
result += *c;
if (isControlCharacter(*c)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.
But I think there is no need to add more method.
You can refer these code : https://github.com/nlohmann/json/blob/a861bdb81f1c5ef0adc270952001096b189dbe9e/include/nlohmann/detail/output/serializer.hpp#L406-#L428

emitUTF maybe not a good name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the code you are referring I guess you mean that I should embed that code as part of the current one which is doing the utf8 encoding?

And yeah, encodeUTF8 would be much better name, but regardless of the encoding, control and special characters should be encoded/escaped.

Copy link
Contributor Author

@TheStormN TheStormN May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've squeezed it a bit, but utf8ToCodepoint should not be called when we don't want to do special processing for utf-8, because it offsets the pointer.

@TheStormN TheStormN requested a review from dota17 May 20, 2020 15:32
} else if (codepoint <
FIRST_SURROGATE_PAIR_CODEPOINT) { // codepoint is in Basic
// Multilingual Plane
const unsigned int codepoint = emitUTF8 ? *c : utf8ToCodepoint(c, end);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: *c needs to be replaced with static_cast<unsigned char>(*c).

Please take out all the added 'u' numeric literal suffixes. They aren't doing anything and they're adding diff to this change.

Remove the const here so we can keep the surrogate pair encoding clear.

Condition flip caused code to move around more than we need. I do appreciate the de-Yoda-fication on the lower bound part of the condition though.

It's all so unclear! How about we simplify it.

unsigned codepoint;
if (emitUTF8) {
    codepoint = static_cast<unsigned char>(*c);
} else {
    codepoint = utf8ToCodepoint(c, end);  // modifies `c`
}

if (codepoint >= 0x20 && codepoint <= 0x7f) {
    result.append(static_cast<char>(cp));
} else {
    auto appendHexChar = [&result](unsigned ch) {
        result.append("\\u").append(toHex16Bit(ch));
    };
    if (codepoint < 0x10000) {
        // Basic Multilingual Plane
        appendHexChar(codepoint);
    } else {
        // Extended Unicode. Encode 20 bits as a surrogate pair.
        codepoint -= 0x10000;
        appendHexChar(0xd800 + (codepoint >> 10) & 0x3ff);
        appendHexChar(0xdc00 + codepoint & 0x3ff);
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code you've proposed does not pass the unit tests. I've tried to fix but it wasn't something obvious. If you can do it, I'll be glad to close this PR and let you fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some rework on my own code. Check if you like it more now, or if you can fix your example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed my example, at your invitation, (PR #1178)
I'd like to go with my version of the control flow here.

I also added a pretty good test we should integrate.

Copy link
Contributor Author

@TheStormN TheStormN May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We all have our approaches and opinions on what's best. At first I wanted to fix the bug with minimal diff, as you can see in my first commit, later a several refactoring approaches were pointed to which I tried to comply...

Anyway I mostly care about the result, so as long as your PR is merged, this one can be closed.

result += "\\u";
result += toHex16Bit((codepoint >> 10) + 0xD800);
result += toHex16Bit((codepoint >> 10u) + 0xD7C0u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the previous formulation where the codepoint is modified and then encoded. I guess that this was done so codepoint could be declared const, and I don't agree with that tradeoff.

Copy link
Contributor Author

@TheStormN TheStormN May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the const wasn't the only reason, it was actually one operation (minus) less. I don't see what's the point to do it in 2 steps, but for the sake of this bugfix I don't care much so I will change it as you want.

@TheStormN TheStormN requested a review from BillyDonahue May 20, 2020 20:32
BillyDonahue added a commit to BillyDonahue/jsoncpp that referenced this pull request May 21, 2020
BillyDonahue added a commit to BillyDonahue/jsoncpp that referenced this pull request May 21, 2020
BillyDonahue added a commit to BillyDonahue/jsoncpp that referenced this pull request May 21, 2020
BillyDonahue added a commit to BillyDonahue/jsoncpp that referenced this pull request May 21, 2020
@TheStormN TheStormN requested a review from dota17 May 21, 2020 08:42
BillyDonahue added a commit that referenced this pull request May 21, 2020
* Escape control chars even if emitting UTF8

See #1176
Fixes #1175

* review comments

* fix test by stopping early enough to punt on utf8-input.
@TheStormN
Copy link
Contributor Author

Closed in favor of #1178

@TheStormN TheStormN closed this May 21, 2020
@TheStormN TheStormN deleted the fix-control-chars-encoding-in-utf8emit-mode branch May 21, 2020 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Control characters are not escaped when emitUTF8 = true
4 participants