Skip to content

EQL: Introduce ~ grammar for case-insensitive functions #67869

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jan 28, 2021

Conversation

costin
Copy link
Member

@costin costin commented Jan 24, 2021

By default, operators and string comparisons are case sensitive. This PR
enables support for case insensitive comparisons through the ~ (tilde)
symbol which works on functions.

For example:
process where ~endsWith(file.path, "ExE")

will match all file.path that end with "exe", regardless of their case.

In the process, refactored the code by introducing a common class and
improving code consistency.

Fix #67868

By default, operators and string comparisons are case sensitive. This PR
enables support for case insensitive comparisons through the ~ (tilde)
symbol which works on functions.

For example:
process where ~endsWith(file.path, "ExE")

will match all file.path that end with "exe", regardless of their case.

In the process, refactored the code by introducing a common class and
improving code consistency.

Fix elastic#67868
@elasticmachine elasticmachine added the Team:QL (Deprecated) Meta label for query languages team label Jan 24, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

@costin costin added the v7.12.0 label Jan 24, 2021
@costin
Copy link
Member Author

costin commented Jan 24, 2021

This PR is a doozy - I realize it's big and so was working on it :)
The new functionality lies in extending EqlFunctionRegistry to support case insensitive functions which are tagged in the grammar. In the proper however I've ended up refactoring a bit both the functions and their tests since there was significant repetition between them.
In the process I've changed the case sensitivity parameter from case sensitive to case insensitive to align it with the underlying ES/Lucene conventions and some inconsistencies since some functions had it one way, others the opposite.
The tests, both integration and unit have been enabled and in the process realized some were simply incorrect so these had to be updated.
Due to the size and the number of files touched this PR might not address all functionality around case-insensitive functions; I'd rather have a thorough follow-up from somebody else not just around integration testing but also around EQL benchmarks.

@rw-access
Copy link
Contributor

Is this the right syntax? I thought we were leaning towards ~ as a suffix not a prefix. i.e. startswith~(process.name, "net.")

@costin
Copy link
Member Author

costin commented Jan 24, 2021

Hi @rw-access

You are right, now that you mention it I recall that was the outcome. The prefix worked it way as I was modifying the queries and due to ease of use, the ~ ended as a prefix instead of a suffix.
I'll update the grammar latter on.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some comments.


return isStringAndExact(pattern, sourceText(), ParamOrdinal.SECOND);
public EndsWith(Source source, Expression input, Expression pattern, boolean caseInsensitive) {
super(source, input, pattern, caseInsensitive);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the fourth parameter here - caseInsensitive - is in contradiction with the name of the parameter in the superclass' constructor.

@Override
public boolean foldable() {
return input.foldable() && pattern.foldable();
return new EndsWithFunctionPipe(source(), this, Expressions.pipe(left()), Expressions.pipe(right()), isCaseInsensitive());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is not present in this PR, but EndsWithFunctionPipe has isCaseSensitive as a case sensitivity flag. From what I've seen in the PR, you make a case about case-insensitivity. Unless I'm missing something, it would be better if the case sensitivity aspect is consistent throughout the code base (naming wise). Meaning, change isCaseSensitive to isCaseInsensitive everywhere in code.
Also, the following classes still have the same caseSensitive parameter:

  • StringContains
  • StringContainsFunctionPipe
  • BetweenFunctionProcessor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

}

public Expression right() {
return input;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return input;
return right;

@@ -17,22 +19,29 @@

public static final String NAME = "sbtw";

private final Processor input, left, right, greedy, caseSensitive;
private final Processor input, left, right, greedy;
private final boolean isCaseSensitive;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming of this variable is also inconsistent throughout the code from a different angle: here you have isCaseSensitive while in other places (for example BetweenFunctionPipe) it's caseInsensitive (is is used in some places while not used in others).

@costin
Copy link
Member Author

costin commented Jan 25, 2021

@elasticmachine update branch

Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

'''
expected_event_ids = [88, 92]
description = "check built-in string functions"

[[queries]]
name = "startsWithCaseInsensitive3"
query = '''
file where opcode==0 and startsWith(file_name, "expLORER.exe")
file where opcode==0 and startsWith~(file_name, "expLORER.exe")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not introduced by this PR, but: is this test bringing anything new to startsWithCaseInsensitive2 ? If so, maybe the name should reflect that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous query checks explorer., the second explorer.exe

Copy link
Contributor

@bpintea bpintea Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both startsWithCaseInsensitive2 and startsWithCaseInsensitive3 test mid-word capitals, but I guess the real difference is that one contains the full word and the other doesn't, which was probably the intend to have them separately. All good.
Edit: so much more as it's been updated :-)

Comment on lines +35 to +38
@Override
protected Builder builder() {
return super.builder();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Member Author

@costin costin Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the parent method is protected and is not visible outside classes in the same package - that is nobody outside the package in ql has access to it. By overriding it, it becomes accessible in the eql package as well.

return def(function, builder, names);
}

private static Boolean defaultSensitivity(Boolean caseInsensitive) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: my implementation-not-yet-read interpretation is that the function returns the default sensitivity (a constant). Smth in the lines of sensitivityOrDefault(...) might arguably be clearer.

Comment on lines 195 to 197
wildcardQuery = "*" + string + "*";
} else if (f instanceof EndsWith) {
wildcardQuery = "*" + string;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if feasible by now (if already mostly used like that in the codebase), but extracting the wildcard character into a constant might make it cleaner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The String is already internalized by the JVM however I've extracted inside the utils class since we use it there.
The issue with extracting constants is that if they aren't used consistently, they just occupy space.

@@ -33,19 +34,22 @@
public static final String INTERNAL_EQL_SCRIPT_UTILS = "InternalEqlScriptUtils";
public static final String INTERNAL_SQL_SCRIPT_UTILS = "InternalSqlScriptUtils";

private Scripts() {}
private static final int PKG_LENGTH = "org.elasticsearch.xpack.".length();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@palesz palesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of concerns about deser to/from StreamInput/Output, otherwise only a few nitpicks and minor comments.

Comment on lines +230 to +233
TILDE_IDENTIFIER
: LETTER (LETTER | DIGIT | '_')* '~'
;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I expected that the functionName will be IDENTIFIER '~'?. Why are function names allowed to start with _ and @ only if the function is not suffixed with ~?

I guess you don't want to allow ~ for @timestamp, but what about functions starting with _?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Defining a new token just for function names would overlap with the IDENTIIFER and the grammar ends up applying whoever is declared first due to ambiguity leading to rules not matching and thus broken queries.

public enum EqlFunctionResolution implements FunctionResolutionStrategy {

/**
* Indicate the function should be insensitive {@code ~startwith()},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Indicate the function should be insensitive {@code ~startwith()},
* Indicate the function should be insensitive {@code startwith~()},

Comment on lines +124 to +127
[[endswith_sensitive.fold.tests]]
expression = 'endsWith~("FooBarBaz", "baz")'
expected = true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The naming is misleading (sensitive as in sensitivity not case sensitive vs insensitive) due to how the spec file was created - which should be changed.

caseInsensitive = in.readBoolean();
} else {
in.readNamedWriteable(Processor.class);
caseInsensitive = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before it was caseSensitive = Boolean.TRUE:

Shouldn't this be:

Suggested change
caseInsensitive = true;
caseInsensitive = false;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, before it was caseSensitive true not its caseInsensitive false. Please see again the main comments made on the issue.

}

public EndsWithFunctionProcessor(StreamInput in) throws IOException {
input = in.readNamedWriteable(Processor.class);
pattern = in.readNamedWriteable(Processor.class);
isCaseSensitive = in.readBoolean();
isCaseInsensitive = in.readBoolean();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you do a version check here similar to the BetweenFunctionProcessor? Meaning of the boolean flipped between 7.11 -> 7.12.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that was actually a leftover. The processors in EQL have serialization which we use in testing but there's no serialization on the client.
The SQL ones use it for the cursor however we offer no guarantees, as of yet, on allowing scrolling across nodes with different SQL versions (essentially a rolling upgrade).

return new DummyFunction(l);
}, "DUMMY_FUNCTION");
FunctionRegistry r = new EqlFunctionRegistry(definition);
FunctionDefinition def = r.resolveFunction(ur.name());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing: assertEquals(ur.source(), ur.buildResolved(randomConfiguration(), def).source());

assertThat(e.getMessage(), endsWith("expects exactly two arguments"));
}

public void testTernaryCaseAwareWithOptionalFunction() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case needs an overhaul, it doesn't really test what the name suggests and half of the asserts are never tested.

Comment on lines 31 to +41
public StartsWithFunctionProcessor(StreamInput in) throws IOException {
source = in.readNamedWriteable(Processor.class);
pattern = in.readNamedWriteable(Processor.class);
isCaseSensitive = in.readBoolean();
caseInsensitive = in.readBoolean();
}

@Override
public final void writeTo(StreamOutput out) throws IOException {
out.writeNamedWriteable(source);
out.writeNamedWriteable(pattern);
out.writeBoolean(isCaseSensitive);
out.writeBoolean(caseInsensitive);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@@ -60,13 +60,13 @@ public static Object doProcess(Object source, Object pattern, boolean isCaseSens
throw new QlIllegalArgumentException("A string/char is required; received [{}]", pattern);
}

if (isCaseSensitive) {
if (caseInsensitive == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flip the if.

randomStringLiteral(),
randomStringLiteral(),
randomStringLiteral(),
randomStringLiteral(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: why did the greedy param change to randomStringLIteral?

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

public enum EqlFunctionResolution implements FunctionResolutionStrategy {

/**
* Indicate the function should be insensitive {@code ~startwith()},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Indicate the function should be insensitive {@code ~startwith()},
* Indicate the function should be insensitive {@code startwith~()},

@Override
public boolean foldable() {
return input.foldable() && pattern.foldable();
return new EndsWithFunctionPipe(source(), this, Expressions.pipe(left()), Expressions.pipe(right()), isCaseInsensitive());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@costin
Copy link
Member Author

costin commented Jan 28, 2021

@elasticmachine update branch

@costin costin merged commit afa4ce0 into elastic:master Jan 28, 2021
@costin costin deleted the eql/tilde branch January 28, 2021 22:32
@costin
Copy link
Member Author

costin commented Jan 28, 2021

/cc @jrodewig

costin added a commit that referenced this pull request Jan 29, 2021
By default, operators and string comparisons are case sensitive. This PR
enables support for case insensitive comparisons through the ~ (tilde)
symbol which works on functions.

For example:
process where endsWith~(file.path, "ExE")

will match all file.path that end with "exe", regardless of their case.

In the process, refactored the code by introducing a common class and
improving code consistency.

Fix #67868

(cherry picked from commit afa4ce0)
jrodewig added a commit that referenced this pull request Jan 29, 2021
Documents the case-insensitive `~` operator for `in` and string functions.

Relates to #67869 and #68176
jrodewig added a commit that referenced this pull request Jan 29, 2021
Documents the case-insensitive `~` operator for `in` and string functions.

Relates to #67869 and #68176
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EQL: Introduce case-insensitive string functions through ~
8 participants