Skip to content

Commit 3c223ce

Browse files
committed
[ML] Fix 2 digit year regex in find_file_structure (#51469)
The DATE and DATESTAMP Grok patterns match 2 digit years as well as 4 digit years. The pattern determination in find_file_structure worked correctly in this case, but the regex used to create a multi-line start pattern was assuming a 4 digit year. Also, the quick rule-out patterns did not always correctly consider 2 digit years, meaning that detection was inconsistent. This change fixes both problems, and also extends the tests for DATE and DATESTAMP to check both 2 and 4 digit years.
1 parent 8559ff7 commit 3c223ce

File tree

2 files changed

+184
-23
lines changed

2 files changed

+184
-23
lines changed

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/filestructurefinder/TimestampFormatFinder.java

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -159,13 +159,18 @@ public final class TimestampFormatFinder {
159159
"%{MONTH} +%{MONTHDAY} %{YEAR} %{HOUR}:%{MINUTE}:(?:[0-5][0-9]|60)\\b", "CISCOTIMESTAMP",
160160
Arrays.asList(" 11 1111 11 11 11", " 1 1111 11 11 11"), 1, 0),
161161
new CandidateTimestampFormat(CandidateTimestampFormat::indeterminateDayMonthFormatFromExample,
162-
"\\b\\d{1,2}[/.-]\\d{1,2}[/.-]\\d{4}[- ]\\d{2}:\\d{2}:\\d{2}\\b", "\\b%{DATESTAMP}\\b", "DATESTAMP",
163-
// In DATESTAMP the month may be 1 or 2 digits, but the day must be 2
164-
Arrays.asList("11 11 1111 11 11 11", "1 11 1111 11 11 11", "11 1 1111 11 11 11"), 0, 10),
162+
"\\b\\d{1,2}[/.-]\\d{1,2}[/.-](?:\\d{2}){1,2}[- ]\\d{2}:\\d{2}:\\d{2}\\b", "\\b%{DATESTAMP}\\b", "DATESTAMP",
163+
// In DATESTAMP the month may be 1 or 2 digits, the year 2 or 4, but the day must be 2
164+
// Also note the Grok pattern search space is set to start one character before a quick rule-out
165+
// match because we don't want 11 11 11 matching into 1111 11 11 with this pattern
166+
Arrays.asList("11 11 1111 11 11 11", "1 11 1111 11 11 11", "11 1 1111 11 11 11", "11 11 11 11 11 11", "1 11 11 11 11 11",
167+
"11 1 11 11 11 11"), 1, 10),
165168
new CandidateTimestampFormat(CandidateTimestampFormat::indeterminateDayMonthFormatFromExample,
166-
"\\b\\d{1,2}[/.-]\\d{1,2}[/.-]\\d{4}\\b", "\\b%{DATE}\\b", "DATE",
167-
// In DATE the month may be 1 or 2 digits, but the day must be 2
168-
Arrays.asList("11 11 1111", "11 1 1111", "1 11 1111"), 0, 0),
169+
"\\b\\d{1,2}[/.-]\\d{1,2}[/.-](?:\\d{2}){1,2}\\b", "\\b%{DATE}\\b", "DATE",
170+
// In DATE the month may be 1 or 2 digits, the year 2 or 4, but the day must be 2
171+
// Also note the Grok pattern search space is set to start one character before a quick rule-out
172+
// match because we don't want 11 11 11 matching into 1111 11 11 with this pattern
173+
Arrays.asList("11 11 1111", "11 1 1111", "1 11 1111", "11 11 11", "11 1 11", "1 11 11"), 1, 0),
169174
UNIX_MS_CANDIDATE_FORMAT,
170175
UNIX_CANDIDATE_FORMAT,
171176
TAI64N_CANDIDATE_FORMAT,

0 commit comments

Comments
 (0)