Add a scripted similarity. #25831

jpountz · 2017-07-21T13:29:55Z

The goal of this similarity is to help users who would like to keep the
functionality of the tf-idf similarity that we want to remove, or to allow
for specific use-cases (disabling idf, disabling tf, disabling length norm,
etc.) to not have to build a custom plugin and familiarize with the low-level
Lucene API.

This is a work-in-progress that needs more tests, but I would like to get
early feedback about the impact of this PR on the scripting API and whether
I should do things differently.

jdconrad

Left a couple of minor comments on the Painless side of things. @rjernst Will have to ensure the rest of the code for correctness :)

jdconrad · 2017-07-21T16:29:42Z

modules/lang-painless/src/main/java/org/elasticsearch/painless/Definition.java

@@ -660,7 +660,7 @@ private void addElements() {
    }

    private void addStruct(final String name, final Class<?> clazz) {
-        if (!name.matches("^[_a-zA-Z][\\.,_a-zA-Z0-9]*$")) {
+        if (!name.matches("^[_a-zA-Z][\\.,_a-zA-Z0-9\\$]*$")) {


We actually don't allow types in Painless to have the '$' as part of the type. If I remember correctly the history here is that we have internal variables that have '$' that we don't want to have conflicts. See later comment for a recommendation on how to resolve this.

If we want to change this we'd also have to modify the existing lexer/parser which I'm not sure is worth it with the workaround I mention in a later comment.

jdconrad · 2017-07-21T16:31:48Z

modules/lang-painless/src/main/resources/org/elasticsearch/painless/org.elasticsearch.txt

@@ -165,3 +165,30 @@ class org.elasticsearch.search.lookup.FieldLookup -> org.elasticsearch.search.lo
  List getValues()
  boolean isEmpty()
 }
+
+class org.elasticsearch.index.similarity.ScriptedSimilarity$Query -> org.elasticsearch.index.similarity.ScriptedSimilarity$Query extends Object {


For the whitelisted classes you can change the type name to be like the following:

class org.elasticsearch.index.similarity.ScriptedSimilarity.Query -> org.elasticsearch.index.similarity.ScriptedSimilarity$Query extends Object {

Note the '$' got changed to '.' in the Painless name of the type (first piece) while it still represents the appropriate Java class (second piece). All the rest of the defined types will work this way too.

rjernst

I left some comments. Overall the idea is good, we should just be using the new script contexts for this instead of the legacy executable script.

rjernst · 2017-07-24T15:21:21Z

core/src/main/java/org/elasticsearch/index/similarity/IBSimilarityProvider.java

@@ -65,7 +66,7 @@

    private final IBSimilarity similarity;

-    public IBSimilarityProvider(String name, Settings settings, Settings indexSettings) {
+    public IBSimilarityProvider(String name, Settings settings, Settings indexSettings, ScriptService scriptService) {


Can we avoid adding ScriptService to the ctor of every provider? It should only be needed for the new one?

rjernst · 2017-07-24T15:22:32Z

core/src/main/java/org/elasticsearch/index/similarity/ScriptedSimilarity.java

+        }
+    }
+
+    public static class Stats {


Some javadocs on these please?

rjernst · 2017-07-24T15:24:39Z

core/src/main/java/org/elasticsearch/index/similarity/ScriptedSimilarity.java

+public final class ScriptedSimilarity extends Similarity {
+
+    private final String scriptString;
+    private final Supplier<ExecutableScript> scriptSupplier;


We should be using a new script context here. Then this can be SimilarityScript.Factory. The new context can return float directly, and take Stats as an arg.

rjernst · 2017-07-24T15:26:26Z

core/src/main/java/org/elasticsearch/index/similarity/ScriptedSimilarityProvider.java

+        super(name);
+        boolean discountOverlaps = settings.getAsBoolean("discount_overlaps", true);
+        String lang = settings.get("lang", Script.DEFAULT_SCRIPT_LANG);
+        String source = settings.get("source");


You should be able to use Script.parse? Or we should make it so you can. We should not have to duplicate all this (it is something that has been a pain point in ingest scripts as it must be kept synchronized with other script parsing code).

rjernst · 2017-07-24T15:29:53Z

modules/lang-painless/src/main/java/org/elasticsearch/painless/GenericElasticsearchScript.java


    public abstract boolean needs_score();
    public abstract boolean needsCtx();
+    public abstract boolean needsStats();


You shoudl not need this if you create a new context, SimilarityScript.

rjernst · 2017-07-24T15:30:12Z

modules/lang-painless/src/main/java/org/elasticsearch/painless/ScriptImpl.java

@@ -93,6 +96,9 @@
    @Override
    public void setNextVar(final String name, final Object value) {
        variables.put(name, value);
+        if (script.needsStats() && "stats".equals(name)) {


Should not be needed

jpountz · 2017-07-25T12:35:05Z

Thanks for the notes about how to use the new context API, I know I was doing something wrong but I wasn't sure what I was supposed to do instead. I addressed all comments, would you mind having another look? I'm especially interested to know whether there are things that could be done more efficiently as I'd really like to be as efficient as a similarity plugin.

rjernst · 2017-07-27T02:59:36Z

core/src/main/java/org/elasticsearch/script/SimilarityScript.java

+public abstract class SimilarityScript  {
+
+    /** Compute the score. */
+    public abstract double execute(double weight, ScriptedSimilarity.Query query,


Why not have this return float since you just cast it to that anyways when it is used? Then the script writer can make the explicit choice as to how the precision is reduced to float?

I think scripts would use doubles as an intermediate representation almost all the time anyway given that functions like log or sqrt produce doubles, so I felt like returning a double and casting on the elasticsearch side would save one cast from every similarity script.

Also we might switch to doubles for scores in the future (https://issues.apache.org/jira/browse/LUCENE-7517).

That makes sense.

rjernst

Thanks @jpountz. I left some more comments, but this is much better using the new script context functionality.

rjernst · 2017-08-01T23:24:44Z

core/src/main/java/org/elasticsearch/index/similarity/ScriptedSimilarity.java

+
+    final String initScriptString;
+    final String scriptString;
+    final Supplier<SimilarityScript> initScriptSupplier;


Can we just use SimilarityScript.Factory here?

rjernst · 2017-08-01T23:29:11Z

core/src/main/java/org/elasticsearch/index/similarity/ScriptedSimilarity.java

+    /** Statistics that are specific to a given field. */
+    public static class Field {
+        final long docCount;
+        final long sumDocFreq;


These could be private since the getters are public? Or the getters could be removed and just make these public final?

rjernst · 2017-08-01T23:42:44Z

core/src/main/java/org/elasticsearch/index/similarity/ScriptedSimilarityProvider.java

+        SimilarityScript.Factory initScriptFactory = null;
+        if (initScriptSettings.isEmpty() == false) {
+            initScript = Script.parse(initScriptSettings);
+            initScriptFactory = scriptService.compile(initScript, SimilarityScript.CONTEXT);


I think this should be a different script context. Then its execute can only take in query, field and term. I also think the name "init_script" is too generic? Maybe term_weight_script?

jpountz · 2017-08-02T07:49:15Z

@rjernst Thanks for the review, I pushed a new commit.

jpountz · 2017-08-07T06:55:22Z

@rjernst Could you take another look please?

rjernst

LGTM, thanks for all the changes!

rjernst · 2017-08-07T17:15:01Z

core/src/main/java/org/elasticsearch/index/similarity/ScriptedSimilarity.java

+ */
+public final class ScriptedSimilarity extends Similarity {
+
+    final String weightScriptString;


nit: can we use the suffix "source" instead of "string" for these scripts? That matches what we call it now in scripting code.

rjernst · 2017-08-07T17:18:33Z

core/src/main/java/org/elasticsearch/script/SimilarityWeightScript.java

+public abstract class SimilarityWeightScript  {
+
+    /** Compute the weight. */
+    public abstract double execute(ScriptedSimilarity.Query query, ScriptedSimilarity.Field field,


I'm not sure if the javadocs here are the right place, but we should document the parameters for users somewhere?

The goal of this similarity is to help users who would like to keep the functionality of the `tf-idf` similarity that we want to remove, or to allow for specific usec-cases (disabling idf, disabling tf, disabling length norm, etc.) to not have to build a custom plugin and familiarize with the low-level Lucene API.

jpountz added :Similarities WIP labels Jul 21, 2017

jpountz requested review from rjernst and jdconrad July 21, 2017 13:29

jdconrad reviewed Jul 21, 2017

View reviewed changes

rjernst reviewed Jul 24, 2017

View reviewed changes

jpountz force-pushed the feature/scripted_sim branch from fd16503 to 2ade5de Compare July 25, 2017 12:30

jpountz removed the WIP label Jul 25, 2017

rjernst reviewed Jul 27, 2017

View reviewed changes

rjernst reviewed Aug 1, 2017

View reviewed changes

jpountz force-pushed the feature/scripted_sim branch from 901849c to 3073585 Compare August 2, 2017 07:47

jpountz mentioned this pull request Aug 2, 2017

Disallow the classic (TF-IDF) similarity on 6.0 indices. #23208

Closed

rjernst approved these changes Aug 7, 2017

View reviewed changes

jpountz added 6 commits August 8, 2017 08:05

iter

37fe3c4

Fix ScriptServiceTests.

bbb3e4f

iter

4df6283

line length

90c2c70

itre

7c73631

jpountz force-pushed the feature/scripted_sim branch from 329717b to 7c73631 Compare August 8, 2017 06:06

iter

5810de1

jpountz added v6.1.0 v7.0.0 >feature and removed review labels Aug 8, 2017

jpountz merged commit f0cba4f into elastic:master Aug 8, 2017

jpountz deleted the feature/scripted_sim branch August 8, 2017 06:55

jpountz added the release highlight label Aug 8, 2017

clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Similarities labels Feb 14, 2018

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

pablogps mentioned this pull request Aug 28, 2019

Possible error in similarity documentation #46058

Closed

Add a scripted similarity. #25831

Add a scripted similarity. #25831

Uh oh!

Conversation

jpountz commented Jul 21, 2017

Uh oh!

jdconrad left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjernst left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpountz commented Jul 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjernst left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpountz commented Aug 2, 2017

Uh oh!

jpountz commented Aug 7, 2017

Uh oh!

rjernst left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jdconrad left a comment •

edited

Loading