-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Add a scripted similarity. #25831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a scripted similarity. #25831
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of minor comments on the Painless side of things. @rjernst Will have to ensure the rest of the code for correctness :)
@@ -660,7 +660,7 @@ private void addElements() { | |||
} | |||
|
|||
private void addStruct(final String name, final Class<?> clazz) { | |||
if (!name.matches("^[_a-zA-Z][\\.,_a-zA-Z0-9]*$")) { | |||
if (!name.matches("^[_a-zA-Z][\\.,_a-zA-Z0-9\\$]*$")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually don't allow types in Painless to have the '$' as part of the type. If I remember correctly the history here is that we have internal variables that have '$' that we don't want to have conflicts. See later comment for a recommendation on how to resolve this.
If we want to change this we'd also have to modify the existing lexer/parser which I'm not sure is worth it with the workaround I mention in a later comment.
@@ -165,3 +165,30 @@ class org.elasticsearch.search.lookup.FieldLookup -> org.elasticsearch.search.lo | |||
List getValues() | |||
boolean isEmpty() | |||
} | |||
|
|||
class org.elasticsearch.index.similarity.ScriptedSimilarity$Query -> org.elasticsearch.index.similarity.ScriptedSimilarity$Query extends Object { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the whitelisted classes you can change the type name to be like the following:
class org.elasticsearch.index.similarity.ScriptedSimilarity.Query -> org.elasticsearch.index.similarity.ScriptedSimilarity$Query extends Object {
Note the '$' got changed to '.' in the Painless name of the type (first piece) while it still represents the appropriate Java class (second piece). All the rest of the defined types will work this way too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments. Overall the idea is good, we should just be using the new script contexts for this instead of the legacy executable script.
@@ -65,7 +66,7 @@ | |||
|
|||
private final IBSimilarity similarity; | |||
|
|||
public IBSimilarityProvider(String name, Settings settings, Settings indexSettings) { | |||
public IBSimilarityProvider(String name, Settings settings, Settings indexSettings, ScriptService scriptService) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid adding ScriptService to the ctor of every provider? It should only be needed for the new one?
} | ||
} | ||
|
||
public static class Stats { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some javadocs on these please?
public final class ScriptedSimilarity extends Similarity { | ||
|
||
private final String scriptString; | ||
private final Supplier<ExecutableScript> scriptSupplier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be using a new script context here. Then this can be SimilarityScript.Factory. The new context can return float
directly, and take Stats as an arg.
super(name); | ||
boolean discountOverlaps = settings.getAsBoolean("discount_overlaps", true); | ||
String lang = settings.get("lang", Script.DEFAULT_SCRIPT_LANG); | ||
String source = settings.get("source"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to use Script.parse? Or we should make it so you can. We should not have to duplicate all this (it is something that has been a pain point in ingest scripts as it must be kept synchronized with other script parsing code).
|
||
public abstract boolean needs_score(); | ||
public abstract boolean needsCtx(); | ||
public abstract boolean needsStats(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shoudl not need this if you create a new context, SimilarityScript.
@@ -93,6 +96,9 @@ | |||
@Override | |||
public void setNextVar(final String name, final Object value) { | |||
variables.put(name, value); | |||
if (script.needsStats() && "stats".equals(name)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not be needed
fd16503
to
2ade5de
Compare
Thanks for the notes about how to use the new context API, I know I was doing something wrong but I wasn't sure what I was supposed to do instead. I addressed all comments, would you mind having another look? I'm especially interested to know whether there are things that could be done more efficiently as I'd really like to be as efficient as a similarity plugin. |
public abstract class SimilarityScript { | ||
|
||
/** Compute the score. */ | ||
public abstract double execute(double weight, ScriptedSimilarity.Query query, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have this return float since you just cast it to that anyways when it is used? Then the script writer can make the explicit choice as to how the precision is reduced to float?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think scripts would use doubles as an intermediate representation almost all the time anyway given that functions like log or sqrt produce doubles, so I felt like returning a double and casting on the elasticsearch side would save one cast from every similarity script.
Also we might switch to doubles for scores in the future (https://issues.apache.org/jira/browse/LUCENE-7517).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jpountz. I left some more comments, but this is much better using the new script context functionality.
|
||
final String initScriptString; | ||
final String scriptString; | ||
final Supplier<SimilarityScript> initScriptSupplier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use SimilarityScript.Factory here?
/** Statistics that are specific to a given field. */ | ||
public static class Field { | ||
final long docCount; | ||
final long sumDocFreq; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These could be private since the getters are public? Or the getters could be removed and just make these public final?
SimilarityScript.Factory initScriptFactory = null; | ||
if (initScriptSettings.isEmpty() == false) { | ||
initScript = Script.parse(initScriptSettings); | ||
initScriptFactory = scriptService.compile(initScript, SimilarityScript.CONTEXT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a different script context. Then its execute can only take in query, field and term. I also think the name "init_script" is too generic? Maybe term_weight_script?
901849c
to
3073585
Compare
@rjernst Thanks for the review, I pushed a new commit. |
@rjernst Could you take another look please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for all the changes!
*/ | ||
public final class ScriptedSimilarity extends Similarity { | ||
|
||
final String weightScriptString; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we use the suffix "source" instead of "string" for these scripts? That matches what we call it now in scripting code.
public abstract class SimilarityWeightScript { | ||
|
||
/** Compute the weight. */ | ||
public abstract double execute(ScriptedSimilarity.Query query, ScriptedSimilarity.Field field, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if the javadocs here are the right place, but we should document the parameters for users somewhere?
The goal of this similarity is to help users who would like to keep the functionality of the `tf-idf` similarity that we want to remove, or to allow for specific usec-cases (disabling idf, disabling tf, disabling length norm, etc.) to not have to build a custom plugin and familiarize with the low-level Lucene API.
329717b
to
7c73631
Compare
The goal of this similarity is to help users who would like to keep the functionality of the `tf-idf` similarity that we want to remove, or to allow for specific usec-cases (disabling idf, disabling tf, disabling length norm, etc.) to not have to build a custom plugin and familiarize with the low-level Lucene API.
The goal of this similarity is to help users who would like to keep the
functionality of the
tf-idf
similarity that we want to remove, or to allowfor specific use-cases (disabling idf, disabling tf, disabling length norm,
etc.) to not have to build a custom plugin and familiarize with the low-level
Lucene API.
This is a work-in-progress that needs more tests, but I would like to get
early feedback about the impact of this PR on the scripting API and whether
I should do things differently.