Skip to content

Add a scripted similarity. #25831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Aug 8, 2017
Merged

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jul 21, 2017

The goal of this similarity is to help users who would like to keep the
functionality of the tf-idf similarity that we want to remove, or to allow
for specific use-cases (disabling idf, disabling tf, disabling length norm,
etc.) to not have to build a custom plugin and familiarize with the low-level
Lucene API.

This is a work-in-progress that needs more tests, but I would like to get
early feedback about the impact of this PR on the scripting API and whether
I should do things differently.

Copy link
Contributor

@jdconrad jdconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of minor comments on the Painless side of things. @rjernst Will have to ensure the rest of the code for correctness :)

@@ -660,7 +660,7 @@ private void addElements() {
}

private void addStruct(final String name, final Class<?> clazz) {
if (!name.matches("^[_a-zA-Z][\\.,_a-zA-Z0-9]*$")) {
if (!name.matches("^[_a-zA-Z][\\.,_a-zA-Z0-9\\$]*$")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't allow types in Painless to have the '$' as part of the type. If I remember correctly the history here is that we have internal variables that have '$' that we don't want to have conflicts. See later comment for a recommendation on how to resolve this.

If we want to change this we'd also have to modify the existing lexer/parser which I'm not sure is worth it with the workaround I mention in a later comment.

@@ -165,3 +165,30 @@ class org.elasticsearch.search.lookup.FieldLookup -> org.elasticsearch.search.lo
List getValues()
boolean isEmpty()
}

class org.elasticsearch.index.similarity.ScriptedSimilarity$Query -> org.elasticsearch.index.similarity.ScriptedSimilarity$Query extends Object {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the whitelisted classes you can change the type name to be like the following:

class org.elasticsearch.index.similarity.ScriptedSimilarity.Query -> org.elasticsearch.index.similarity.ScriptedSimilarity$Query extends Object {

Note the '$' got changed to '.' in the Painless name of the type (first piece) while it still represents the appropriate Java class (second piece). All the rest of the defined types will work this way too.

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. Overall the idea is good, we should just be using the new script contexts for this instead of the legacy executable script.

@@ -65,7 +66,7 @@

private final IBSimilarity similarity;

public IBSimilarityProvider(String name, Settings settings, Settings indexSettings) {
public IBSimilarityProvider(String name, Settings settings, Settings indexSettings, ScriptService scriptService) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid adding ScriptService to the ctor of every provider? It should only be needed for the new one?

}
}

public static class Stats {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some javadocs on these please?

public final class ScriptedSimilarity extends Similarity {

private final String scriptString;
private final Supplier<ExecutableScript> scriptSupplier;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be using a new script context here. Then this can be SimilarityScript.Factory. The new context can return float directly, and take Stats as an arg.

super(name);
boolean discountOverlaps = settings.getAsBoolean("discount_overlaps", true);
String lang = settings.get("lang", Script.DEFAULT_SCRIPT_LANG);
String source = settings.get("source");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use Script.parse? Or we should make it so you can. We should not have to duplicate all this (it is something that has been a pain point in ingest scripts as it must be kept synchronized with other script parsing code).


public abstract boolean needs_score();
public abstract boolean needsCtx();
public abstract boolean needsStats();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shoudl not need this if you create a new context, SimilarityScript.

@@ -93,6 +96,9 @@
@Override
public void setNextVar(final String name, final Object value) {
variables.put(name, value);
if (script.needsStats() && "stats".equals(name)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not be needed

@jpountz jpountz force-pushed the feature/scripted_sim branch from fd16503 to 2ade5de Compare July 25, 2017 12:30
@jpountz jpountz removed the WIP label Jul 25, 2017
@jpountz
Copy link
Contributor Author

jpountz commented Jul 25, 2017

Thanks for the notes about how to use the new context API, I know I was doing something wrong but I wasn't sure what I was supposed to do instead. I addressed all comments, would you mind having another look? I'm especially interested to know whether there are things that could be done more efficiently as I'd really like to be as efficient as a similarity plugin.

public abstract class SimilarityScript {

/** Compute the score. */
public abstract double execute(double weight, ScriptedSimilarity.Query query,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have this return float since you just cast it to that anyways when it is used? Then the script writer can make the explicit choice as to how the precision is reduced to float?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think scripts would use doubles as an intermediate representation almost all the time anyway given that functions like log or sqrt produce doubles, so I felt like returning a double and casting on the elasticsearch side would save one cast from every similarity script.

Also we might switch to doubles for scores in the future (https://issues.apache.org/jira/browse/LUCENE-7517).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jpountz. I left some more comments, but this is much better using the new script context functionality.


final String initScriptString;
final String scriptString;
final Supplier<SimilarityScript> initScriptSupplier;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use SimilarityScript.Factory here?

/** Statistics that are specific to a given field. */
public static class Field {
final long docCount;
final long sumDocFreq;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could be private since the getters are public? Or the getters could be removed and just make these public final?

SimilarityScript.Factory initScriptFactory = null;
if (initScriptSettings.isEmpty() == false) {
initScript = Script.parse(initScriptSettings);
initScriptFactory = scriptService.compile(initScript, SimilarityScript.CONTEXT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a different script context. Then its execute can only take in query, field and term. I also think the name "init_script" is too generic? Maybe term_weight_script?

@jpountz jpountz force-pushed the feature/scripted_sim branch from 901849c to 3073585 Compare August 2, 2017 07:47
@jpountz
Copy link
Contributor Author

jpountz commented Aug 2, 2017

@rjernst Thanks for the review, I pushed a new commit.

@jpountz
Copy link
Contributor Author

jpountz commented Aug 7, 2017

@rjernst Could you take another look please?

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for all the changes!

*/
public final class ScriptedSimilarity extends Similarity {

final String weightScriptString;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we use the suffix "source" instead of "string" for these scripts? That matches what we call it now in scripting code.

public abstract class SimilarityWeightScript {

/** Compute the weight. */
public abstract double execute(ScriptedSimilarity.Query query, ScriptedSimilarity.Field field,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the javadocs here are the right place, but we should document the parameters for users somewhere?

jpountz added 6 commits August 8, 2017 08:05
The goal of this similarity is to help users who would like to keep the
functionality of the `tf-idf` similarity that we want to remove, or to allow
for specific usec-cases (disabling idf, disabling tf, disabling length norm,
etc.) to not have to build a custom plugin and familiarize with the low-level
Lucene API.
@jpountz jpountz force-pushed the feature/scripted_sim branch from 329717b to 7c73631 Compare August 8, 2017 06:06
@jpountz jpountz merged commit f0cba4f into elastic:master Aug 8, 2017
@jpountz jpountz deleted the feature/scripted_sim branch August 8, 2017 06:55
jpountz added a commit that referenced this pull request Aug 8, 2017
The goal of this similarity is to help users who would like to keep the
functionality of the `tf-idf` similarity that we want to remove, or to allow
for specific usec-cases (disabling idf, disabling tf, disabling length norm,
etc.) to not have to build a custom plugin and familiarize with the low-level
Lucene API.
@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Similarities labels Feb 14, 2018
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature release highlight :Search/Search Search-related issues that do not fall into other categories v6.1.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants