Skip to content

Commit 8e6be14

Browse files
committed
Fix problem with CP949 conversion when 0xC9 precedes byte lower than 0xA1
This bug was introduced in e837a88. In that commit, I increased the performance of CP949 text conversion, but accidentally broke the case where 0xC9 (illegal byte to start a character) is followed by a valid character with a first byte less than 0xA1. The 'broken' behavior is that both the 0xC9 byte and the following valid character would be converted to error markers.
1 parent f337c92 commit 8e6be14

File tree

2 files changed

+10
-9
lines changed

2 files changed

+10
-9
lines changed

ext/mbstring/libmbfl/filters/mbfilter_cjk.c

+5-9
Original file line numberDiff line numberDiff line change
@@ -10224,17 +10224,13 @@ static size_t mb_uhc_to_wchar(unsigned char **in, size_t *in_len, uint32_t *buf,
1022410224
w = (c - 0xC7)*94 + c2 - 0xA1;
1022510225
ZEND_ASSERT(w < uhc3_ucs_table_size);
1022610226
w = uhc3_ucs_table[w];
10227-
if (!w) {
10228-
/* If c == 0xC9, we shouldn't have tried to read a 2-byte char at all... but it is faster
10229-
* to fix up that rare case here rather than include an extra check in the hot path */
10230-
if (c == 0xC9) {
10231-
p--;
10232-
}
10233-
*out++ = MBFL_BAD_INPUT;
10234-
continue;
10235-
}
1023610227
}
1023710228
if (!w) {
10229+
/* If c == 0xC9, we shouldn't have tried to read a 2-byte char at all... but it is faster
10230+
* to fix up that rare case here rather than include an extra check in the hot path */
10231+
if (c == 0xC9) {
10232+
p--;
10233+
}
1023810234
w = MBFL_BAD_INPUT;
1023910235
}
1024010236
*out++ = w;

ext/mbstring/tests/uhc_encoding.phpt

+5
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,11 @@ testEncodingFromUTF16ConversionTable(__DIR__ . '/data/CP949.txt', 'UHC');
1414
// Regression test
1515
convertInvalidString("\xE4\xA4\xB4<", "\x75\x1A\x00%", "UHC", "UTF-16BE");
1616

17+
// When optimizing performance of CP949 conversion, I accidentally broke the
18+
// case where 0xC9 appears before a valid character which starts with a
19+
// byte lower than 0xA1
20+
convertInvalidString("\xC9\x9E\x98", "%\xEC\x98\x92", "UHC", "UTF-8");
21+
1722
// Test "long" illegal character markers
1823
mb_substitute_character("long");
1924
convertInvalidString("\x80", "%", "UHC", "UTF-8");

0 commit comments

Comments
 (0)