-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Fixed control chars escaping with enabled emitUTF8 #1176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed control chars escaping with enabled emitUTF8 #1176
Conversation
src/lib_json/json_writer.cpp
Outdated
@@ -310,7 +310,12 @@ static String valueToQuotedStringN(const char* value, unsigned length, | |||
// sequence from occurring. | |||
default: { | |||
if (emitUTF8) { | |||
result += *c; | |||
if (isControlCharacter(*c)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
But I think there is no need to add more method.
You can refer these code : https://github.com/nlohmann/json/blob/a861bdb81f1c5ef0adc270952001096b189dbe9e/include/nlohmann/detail/output/serializer.hpp#L406-#L428
emitUTF
maybe not a good name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the code you are referring I guess you mean that I should embed that code as part of the current one which is doing the utf8 encoding?
And yeah, encodeUTF8
would be much better name, but regardless of the encoding, control and special characters should be encoded/escaped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've squeezed it a bit, but utf8ToCodepoint
should not be called when we don't want to do special processing for utf-8, because it offsets the pointer.
src/lib_json/json_writer.cpp
Outdated
} else if (codepoint < | ||
FIRST_SURROGATE_PAIR_CODEPOINT) { // codepoint is in Basic | ||
// Multilingual Plane | ||
const unsigned int codepoint = emitUTF8 ? *c : utf8ToCodepoint(c, end); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: *c
needs to be replaced with static_cast<unsigned char>(*c)
.
Please take out all the added 'u' numeric literal suffixes. They aren't doing anything and they're adding diff to this change.
Remove the const here so we can keep the surrogate pair encoding clear.
Condition flip caused code to move around more than we need. I do appreciate the de-Yoda-fication on the lower bound part of the condition though.
It's all so unclear! How about we simplify it.
unsigned codepoint;
if (emitUTF8) {
codepoint = static_cast<unsigned char>(*c);
} else {
codepoint = utf8ToCodepoint(c, end); // modifies `c`
}
if (codepoint >= 0x20 && codepoint <= 0x7f) {
result.append(static_cast<char>(cp));
} else {
auto appendHexChar = [&result](unsigned ch) {
result.append("\\u").append(toHex16Bit(ch));
};
if (codepoint < 0x10000) {
// Basic Multilingual Plane
appendHexChar(codepoint);
} else {
// Extended Unicode. Encode 20 bits as a surrogate pair.
codepoint -= 0x10000;
appendHexChar(0xd800 + (codepoint >> 10) & 0x3ff);
appendHexChar(0xdc00 + codepoint & 0x3ff);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code you've proposed does not pass the unit tests. I've tried to fix but it wasn't something obvious. If you can do it, I'll be glad to close this PR and let you fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some rework on my own code. Check if you like it more now, or if you can fix your example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed my example, at your invitation, (PR #1178)
I'd like to go with my version of the control flow here.
I also added a pretty good test we should integrate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We all have our approaches and opinions on what's best. At first I wanted to fix the bug with minimal diff, as you can see in my first commit, later a several refactoring approaches were pointed to which I tried to comply...
Anyway I mostly care about the result, so as long as your PR is merged, this one can be closed.
src/lib_json/json_writer.cpp
Outdated
result += "\\u"; | ||
result += toHex16Bit((codepoint >> 10) + 0xD800); | ||
result += toHex16Bit((codepoint >> 10u) + 0xD7C0u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the previous formulation where the codepoint is modified and then encoded. I guess that this was done so codepoint could be declared const, and I don't agree with that tradeoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the const
wasn't the only reason, it was actually one operation (minus) less. I don't see what's the point to do it in 2 steps, but for the sake of this bugfix I don't care much so I will change it as you want.
Closed in favor of #1178 |
Fixes #1175