Skip to content

Disallow the classic (TF-IDF) similarity on 6.0 indices. #23208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jpountz opened this issue Feb 16, 2017 · 14 comments
Closed

Disallow the classic (TF-IDF) similarity on 6.0 indices. #23208

jpountz opened this issue Feb 16, 2017 · 14 comments
Labels
>deprecation discuss good first issue low hanging fruit :Search/Search Search-related issues that do not fall into other categories

Comments

@jpountz
Copy link
Contributor

jpountz commented Feb 16, 2017

BM25 should generally perform better than TF-IDF. Moreover, Lucene is removing coords and query normalization in 7.0 (that Elasticsearch will be based on) so we should start deprecating the classic similarity.

@jpountz jpountz added :Search/Search Search-related issues that do not fall into other categories >enhancement good first issue low hanging fruit labels Feb 16, 2017
@colings86 colings86 added the help wanted adoptme label Mar 31, 2017
@colings86
Copy link
Contributor

@jpountz I made this an adoptme as it seems like something we intend to do rather than something that needs further discussion

@Rezaak1024
Copy link

Hi, I'm new to the project and would like to get started on this issue if no one else is already working on it.

@colings86 colings86 added discuss and removed help wanted adoptme labels Jun 5, 2017
@jpountz
Copy link
Contributor Author

jpountz commented Jun 5, 2017

Adding the discuss label to figure out whether we should just disallow it on new indices, or also add it in a plugin for users who might really really need TF-IDF scoring.

@clintongormley
Copy link
Contributor

@jpountz can we really support it without query coordination? If not, then I'd opt for removing it. (Even if we can I'm leaning towards removing it)

@jpountz
Copy link
Contributor Author

jpountz commented Jun 7, 2017

The removal of query coordination might significantly decrease the quality of this similarity indeed. @eskibars Maybe you have some pespective on this, I think you suggested some users might rely specifically on TF-IDF in the last search&aggs meeting?

@eskibars
Copy link
Contributor

eskibars commented Jun 7, 2017

@eskibars Maybe you have some pespective on this, I think you suggested some users might rely specifically on TF-IDF in the last search&aggs meeting?

I'm certain some people rely on the full old/classic TF-IDF implementation, but in a few different ways that I'll tease apart.

  1. We've seen people in our forums mention they have regression tests that include ordering and my own history leads me to believe that while these users will be the minority, they are not terribly uncommon and they often have extensive UAT / and long UAT cycles, which is why they built the tests in the first place. For these users, dropping classic similarity is going to mean one of 2 things: simply avoiding the upgrade entirely and sticking around on 5.x for as long as possible or upgrading a test environment and then re-engaging their long UAT cycles, potentially failing them and having to go edit queries, etc as they go through these cycles.
  2. To a much lesser extent, I've anecdotally heard some users with a fixed catalog of data include the numeric score calculations of some specific queries in their regression tests.
  3. There are people that are convinced TF-IDF is better for their data+queries than BM25. I've heard this a number times: sometimes where the user has done a lot of testing and shown that TF-IDF is better in one way or another for them and sometimes not. I also believe TF-IDF also has more literature around it, is taught in more schools, and is generally a bit easier to understand (the formula is more or less in the name) so there may be some resistance purely due to how understood the two are relative to one another.

To be clear, I think that the vast majority of users do not fall in any of these categories and I've seen BM25 does perform better in the vast majority of UAT (especially in text search) and is built on sound principals avoiding keyword stuffing, etc. Given that Lucene is dropping coord, I'm not sure what we can do for users falling in categories 1 and 2 other than brace them for the impact ASAP. For users in the third category, one thing I was thinking was the possibility of pulling classic similarity out into a plugin that we could bootstrap for the community (and put it in the community's hands to support). No matter what, we also need to accept that for some portion of users, this change will delay (to change queries / go through UAT again) or entirely keep them from upgrading (if they decide they "need" to keep the old behavior).

@jpountz
Copy link
Contributor Author

jpountz commented Jun 15, 2017

I'm wondering whether the scripted similarity feature that we discussed a couple times could be a good workaround rather than creating a plugin. Reimplementing TF-IDF could actually be a good documentation example of it.

@ghost
Copy link

ghost commented Jul 13, 2017

Hello I fall into this category and I have empirical testing to back it up. Please contact me if you have any questions. For my data/application for Elasticsearch TF-IDF produces a better MRR (mean reciprocal rank, a common IR score for search engines) than BM25. If TF-IDF is not retained then I will be stuck on version 5.

There are people that are convinced TF-IDF is better for their data+queries than BM25. I've heard this a number times: sometimes where the user has done a lot of testing and shown that TF-IDF is better in one way or another for them and sometimes not. I also believe TF-IDF also has more literature around it, is taught in more schools, and is generally a bit easier to understand (the formula is more or less in the name) so there may be some resistance purely due to how understood the two are relative to one another.

@jpountz
Copy link
Contributor Author

jpountz commented Aug 2, 2017

We will make this deprecation smoother by adding a scripted similarity: #25831.

jpountz added a commit to jpountz/elasticsearch that referenced this issue Aug 16, 2017
The `classic` similarity used to rely on specific features like query
normalization and coord factors, which were specific to this similarity and
have been removed as the `bm25` similarity has now been the default similarity
for some time. As a consequence, using the `classic` similarity could lead
to depeptive scores, which is why we want to prevent users from using it on new
indices. `bm25` is generally considered a superior option.

Closes elastic#23208
@cbuescher
Copy link
Member

@jpountz I'm not entirely sure about the current state of this issue, is the "discuss" label still appropriate? What are the next steps if not?
cc @elastic/es-search-aggs

jpountz added a commit to jpountz/elasticsearch that referenced this issue Mar 21, 2018
This improves the way similarities are plugged in in order to:
 - reject the classic similarity on 7.x indices and emit a deprecation
   warning otherwise
 - reject unkwown parameters on 7.x indices and emit a deprecation
   warning otherwise

Even though this breaks the plugin API, I'd like to backport to 7.x so
that users can get deprecation warnings when they are doing something
that will become unsupported in the future.

Closes elastic#23208
Closes elastic#29035
jpountz added a commit that referenced this issue Apr 3, 2018
This improves the way similarities are plugged in in order to:
 - reject the classic similarity on 7.x indices and emit a deprecation
   warning otherwise
 - reject unkwown parameters on 7.x indices and emit a deprecation
   warning otherwise

Even though this breaks the plugin API, I'd like to backport to 7.x so
that users can get deprecation warnings when they are doing something
that will become unsupported in the future.

Closes #23208
Closes #29035
jpountz added a commit to jpountz/elasticsearch that referenced this issue Apr 3, 2018
This improves the way similarities are plugged in in order to:
 - reject the classic similarity on 7.x indices and emit a deprecation
   warning otherwise
 - reject unkwown parameters on 7.x indices and emit a deprecation
   warning otherwise

Even though this breaks the plugin API, I'd like to backport to 7.x so
that users can get deprecation warnings when they are doing something
that will become unsupported in the future.

Closes elastic#23208
Closes elastic#29035
jpountz added a commit that referenced this issue Apr 4, 2018
This improves the way similarities are plugged in in order to:
 - reject the classic similarity on 7.x indices and emit a deprecation
   warning otherwise
 - reject unkwown parameters on 7.x indices and emit a deprecation
   warning otherwise

Even though this breaks the plugin API, I'd like to backport to 7.x so
that users can get deprecation warnings when they are doing something
that will become unsupported in the future.

Closes #23208
Closes #29035
@shmed
Copy link

shmed commented Sep 19, 2019

Sorry to leave a comment on an older issue, but I was wondering how the "scripts" solution would help replacing the missing coordinating factors? It seems that to be able to reproduce scoring that are aligned with how the previous classic similarity calculated them, the coordinating factors would be essential, and as far as I understand, those were calculated in the boolean weights. Thank you

@jpountz
Copy link
Contributor Author

jpountz commented Sep 19, 2019

This feature of Lucene's TF-IDF similarity can't be reimplemented with a script indeed. For the record, note that this isn't part of the official definition of TF-IDF but something that has been added on top in order to work around the fact that the TF weighting would allow a document that contains many occurrences of a single query term to score better than documents that contain all query terms. This is no longer an issue with BM25 (and most other modern similarities) whose TF weighting is saturated.

@shmed
Copy link

shmed commented Sep 19, 2019

Thanks a lot @jpountz for the reply. I understand that coords were a work around for the fact that TF-IDF isn't great at favoring documents that matches all terms over a document that only matches a single term multiple time. I also understand that BM25 has better TF saturation which naturally helps with those scenarios. Unfortunately, I have the challenging goal of customizing the classic similarity algorithm in Es7 (Lucene8) such as the resulting scores are the same as Es5 (Lucene6, when coords were still a thing). I was originally hopping that coords could somewhat be re-implemented externally (or through the scripting features), but the concept of coords is so entrangled with the various Lucene scoring classes (boolean scorers, weights, parsers) that its proven to be much more challenging than I originally hopped.

@jpountz
Copy link
Contributor Author

jpountz commented Sep 19, 2019

Replicating version 5 scoring in recent versions is not possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>deprecation discuss good first issue low hanging fruit :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

7 participants