Disallow the `classic` (TF-IDF) similarity on 6.0 indices. #23208

jpountz · 2017-02-16T14:38:37Z

BM25 should generally perform better than TF-IDF. Moreover, Lucene is removing coords and query normalization in 7.0 (that Elasticsearch will be based on) so we should start deprecating the classic similarity.

The text was updated successfully, but these errors were encountered:

colings86 · 2017-03-31T09:57:35Z

@jpountz I made this an adoptme as it seems like something we intend to do rather than something that needs further discussion

Rezaak1024 · 2017-05-14T17:14:10Z

Hi, I'm new to the project and would like to get started on this issue if no one else is already working on it.

jpountz · 2017-06-05T15:24:17Z

Adding the discuss label to figure out whether we should just disallow it on new indices, or also add it in a plugin for users who might really really need TF-IDF scoring.

clintongormley · 2017-06-06T18:21:51Z

@jpountz can we really support it without query coordination? If not, then I'd opt for removing it. (Even if we can I'm leaning towards removing it)

jpountz · 2017-06-07T07:22:02Z

The removal of query coordination might significantly decrease the quality of this similarity indeed. @eskibars Maybe you have some pespective on this, I think you suggested some users might rely specifically on TF-IDF in the last search&aggs meeting?

eskibars · 2017-06-07T14:53:07Z

@eskibars Maybe you have some pespective on this, I think you suggested some users might rely specifically on TF-IDF in the last search&aggs meeting?

I'm certain some people rely on the full old/classic TF-IDF implementation, but in a few different ways that I'll tease apart.

We've seen people in our forums mention they have regression tests that include ordering and my own history leads me to believe that while these users will be the minority, they are not terribly uncommon and they often have extensive UAT / and long UAT cycles, which is why they built the tests in the first place. For these users, dropping classic similarity is going to mean one of 2 things: simply avoiding the upgrade entirely and sticking around on 5.x for as long as possible or upgrading a test environment and then re-engaging their long UAT cycles, potentially failing them and having to go edit queries, etc as they go through these cycles.
To a much lesser extent, I've anecdotally heard some users with a fixed catalog of data include the numeric score calculations of some specific queries in their regression tests.
There are people that are convinced TF-IDF is better for their data+queries than BM25. I've heard this a number times: sometimes where the user has done a lot of testing and shown that TF-IDF is better in one way or another for them and sometimes not. I also believe TF-IDF also has more literature around it, is taught in more schools, and is generally a bit easier to understand (the formula is more or less in the name) so there may be some resistance purely due to how understood the two are relative to one another.

To be clear, I think that the vast majority of users do not fall in any of these categories and I've seen BM25 does perform better in the vast majority of UAT (especially in text search) and is built on sound principals avoiding keyword stuffing, etc. Given that Lucene is dropping coord, I'm not sure what we can do for users falling in categories 1 and 2 other than brace them for the impact ASAP. For users in the third category, one thing I was thinking was the possibility of pulling classic similarity out into a plugin that we could bootstrap for the community (and put it in the community's hands to support). No matter what, we also need to accept that for some portion of users, this change will delay (to change queries / go through UAT again) or entirely keep them from upgrading (if they decide they "need" to keep the old behavior).

jpountz · 2017-06-15T08:40:28Z

I'm wondering whether the scripted similarity feature that we discussed a couple times could be a good workaround rather than creating a plugin. Reimplementing TF-IDF could actually be a good documentation example of it.

ghost · 2017-07-13T17:00:40Z

Hello I fall into this category and I have empirical testing to back it up. Please contact me if you have any questions. For my data/application for Elasticsearch TF-IDF produces a better MRR (mean reciprocal rank, a common IR score for search engines) than BM25. If TF-IDF is not retained then I will be stuck on version 5.

There are people that are convinced TF-IDF is better for their data+queries than BM25. I've heard this a number times: sometimes where the user has done a lot of testing and shown that TF-IDF is better in one way or another for them and sometimes not. I also believe TF-IDF also has more literature around it, is taught in more schools, and is generally a bit easier to understand (the formula is more or less in the name) so there may be some resistance purely due to how understood the two are relative to one another.

jpountz · 2017-08-02T10:16:24Z

We will make this deprecation smoother by adding a scripted similarity: #25831.

The `classic` similarity used to rely on specific features like query normalization and coord factors, which were specific to this similarity and have been removed as the `bm25` similarity has now been the default similarity for some time. As a consequence, using the `classic` similarity could lead to depeptive scores, which is why we want to prevent users from using it on new indices. `bm25` is generally considered a superior option. Closes elastic#23208

cbuescher · 2018-03-13T17:25:16Z

@jpountz I'm not entirely sure about the current state of this issue, is the "discuss" label still appropriate? What are the next steps if not?
cc @elastic/es-search-aggs

This improves the way similarities are plugged in in order to: - reject the classic similarity on 7.x indices and emit a deprecation warning otherwise - reject unkwown parameters on 7.x indices and emit a deprecation warning otherwise Even though this breaks the plugin API, I'd like to backport to 7.x so that users can get deprecation warnings when they are doing something that will become unsupported in the future. Closes elastic#23208 Closes elastic#29035

This improves the way similarities are plugged in in order to: - reject the classic similarity on 7.x indices and emit a deprecation warning otherwise - reject unkwown parameters on 7.x indices and emit a deprecation warning otherwise Even though this breaks the plugin API, I'd like to backport to 7.x so that users can get deprecation warnings when they are doing something that will become unsupported in the future. Closes #23208 Closes #29035

This improves the way similarities are plugged in in order to: - reject the classic similarity on 7.x indices and emit a deprecation warning otherwise - reject unkwown parameters on 7.x indices and emit a deprecation warning otherwise Even though this breaks the plugin API, I'd like to backport to 7.x so that users can get deprecation warnings when they are doing something that will become unsupported in the future. Closes elastic#23208 Closes elastic#29035

This improves the way similarities are plugged in in order to: - reject the classic similarity on 7.x indices and emit a deprecation warning otherwise - reject unkwown parameters on 7.x indices and emit a deprecation warning otherwise Even though this breaks the plugin API, I'd like to backport to 7.x so that users can get deprecation warnings when they are doing something that will become unsupported in the future. Closes #23208 Closes #29035

shmed · 2019-09-19T00:01:32Z

Sorry to leave a comment on an older issue, but I was wondering how the "scripts" solution would help replacing the missing coordinating factors? It seems that to be able to reproduce scoring that are aligned with how the previous classic similarity calculated them, the coordinating factors would be essential, and as far as I understand, those were calculated in the boolean weights. Thank you

jpountz · 2019-09-19T20:10:31Z

This feature of Lucene's TF-IDF similarity can't be reimplemented with a script indeed. For the record, note that this isn't part of the official definition of TF-IDF but something that has been added on top in order to work around the fact that the TF weighting would allow a document that contains many occurrences of a single query term to score better than documents that contain all query terms. This is no longer an issue with BM25 (and most other modern similarities) whose TF weighting is saturated.

shmed · 2019-09-19T20:46:10Z

Thanks a lot @jpountz for the reply. I understand that coords were a work around for the fact that TF-IDF isn't great at favoring documents that matches all terms over a document that only matches a single term multiple time. I also understand that BM25 has better TF saturation which naturally helps with those scenarios. Unfortunately, I have the challenging goal of customizing the classic similarity algorithm in Es7 (Lucene8) such as the resulting scores are the same as Es5 (Lucene6, when coords were still a thing). I was originally hopping that coords could somewhat be re-implemented externally (or through the scripting features), but the concept of coords is so entrangled with the various Lucene scoring classes (boolean scorers, weights, parsers) that its proven to be much more challenging than I originally hopped.

jpountz · 2019-09-19T21:06:34Z

Replicating version 5 scoring in recent versions is not possible.

jpountz added :Search/Search Search-related issues that do not fall into other categories >enhancement good first issue low hanging fruit labels Feb 16, 2017

clintongormley added >deprecation and removed >enhancement labels Feb 16, 2017

colings86 added the help wanted adoptme label Mar 31, 2017

colings86 added discuss and removed help wanted adoptme labels Jun 5, 2017

jimczi mentioned this issue Jun 7, 2017

One similarity per index #18971

Closed

jpountz mentioned this issue Aug 2, 2017

Disallow the classic similarity on new indices. #26016

Closed

jpountz mentioned this issue Mar 21, 2018

Improve similarity integration. #29187

Merged

jpountz closed this as completed in #29187 Apr 3, 2018

pablogps mentioned this issue Aug 28, 2019

Possible error in similarity documentation #46058

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disallow the `classic` (TF-IDF) similarity on 6.0 indices. #23208

Disallow the `classic` (TF-IDF) similarity on 6.0 indices. #23208

jpountz commented Feb 16, 2017

colings86 commented Mar 31, 2017

Rezaak1024 commented May 14, 2017

jpountz commented Jun 5, 2017

clintongormley commented Jun 6, 2017

jpountz commented Jun 7, 2017 •

edited

Loading

eskibars commented Jun 7, 2017

jpountz commented Jun 15, 2017

ghost commented Jul 13, 2017

jpountz commented Aug 2, 2017

cbuescher commented Mar 13, 2018

shmed commented Sep 19, 2019

jpountz commented Sep 19, 2019

shmed commented Sep 19, 2019

jpountz commented Sep 19, 2019

Disallow the classic (TF-IDF) similarity on 6.0 indices. #23208

Disallow the classic (TF-IDF) similarity on 6.0 indices. #23208

Comments

jpountz commented Feb 16, 2017

colings86 commented Mar 31, 2017

Rezaak1024 commented May 14, 2017

jpountz commented Jun 5, 2017

clintongormley commented Jun 6, 2017

jpountz commented Jun 7, 2017 • edited Loading

eskibars commented Jun 7, 2017

jpountz commented Jun 15, 2017

ghost commented Jul 13, 2017

jpountz commented Aug 2, 2017

cbuescher commented Mar 13, 2018

shmed commented Sep 19, 2019

jpountz commented Sep 19, 2019

shmed commented Sep 19, 2019

jpountz commented Sep 19, 2019

Disallow the `classic` (TF-IDF) similarity on 6.0 indices. #23208

Disallow the `classic` (TF-IDF) similarity on 6.0 indices. #23208

jpountz commented Jun 7, 2017 •

edited

Loading