Skip to content

Commit 22415fa

Browse files
authored
[ML] Fix character set finder bug with unencodable charsets (#33234)
Some character sets cannot be encoded and this was tripping up the binary data check in the ML log structure character set finder. The fix is to assume that if ICU4J identifies that some bytes correspond to a character set that cannot be encoded and those bytes contain zeroes then the data is binary rather than text. Fixes #33227
1 parent dd1956c commit 22415fa

File tree

1 file changed

+9
-3
lines changed

1 file changed

+9
-3
lines changed

x-pack/plugin/ml/log-structure-finder/src/main/java/org/elasticsearch/xpack/ml/logstructurefinder/LogStructureFinderManager.java

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -163,9 +163,15 @@ CharsetMatch findCharset(List<String> explanation, InputStream inputStream) thro
163163
// deduction algorithms on binary files is very slow as the binary files generally appear to
164164
// have very long lines.
165165
boolean spaceEncodingContainsZeroByte = false;
166-
byte[] spaceBytes = " ".getBytes(name);
167-
for (int i = 0; i < spaceBytes.length && spaceEncodingContainsZeroByte == false; ++i) {
168-
spaceEncodingContainsZeroByte = (spaceBytes[i] == 0);
166+
Charset charset = Charset.forName(name);
167+
// Some character sets cannot be encoded. These are extremely rare so it's likely that
168+
// they've been chosen based on incorrectly provided binary data. Therefore, err on
169+
// the side of rejecting binary data.
170+
if (charset.canEncode()) {
171+
byte[] spaceBytes = " ".getBytes(charset);
172+
for (int i = 0; i < spaceBytes.length && spaceEncodingContainsZeroByte == false; ++i) {
173+
spaceEncodingContainsZeroByte = (spaceBytes[i] == 0);
174+
}
169175
}
170176
if (containsZeroBytes && spaceEncodingContainsZeroByte == false) {
171177
explanation.add("Character encoding [" + name + "] matched the input with [" + charsetMatch.getConfidence() +

0 commit comments

Comments
 (0)