-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Disallow the classic
(TF-IDF) similarity on 6.0 indices.
#23208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@jpountz I made this an adoptme as it seems like something we intend to do rather than something that needs further discussion |
Hi, I'm new to the project and would like to get started on this issue if no one else is already working on it. |
Adding the |
@jpountz can we really support it without query coordination? If not, then I'd opt for removing it. (Even if we can I'm leaning towards removing it) |
The removal of query coordination might significantly decrease the quality of this similarity indeed. @eskibars Maybe you have some pespective on this, I think you suggested some users might rely specifically on TF-IDF in the last search&aggs meeting? |
I'm certain some people rely on the full old/classic TF-IDF implementation, but in a few different ways that I'll tease apart.
To be clear, I think that the vast majority of users do not fall in any of these categories and I've seen BM25 does perform better in the vast majority of UAT (especially in text search) and is built on sound principals avoiding keyword stuffing, etc. Given that Lucene is dropping |
I'm wondering whether the scripted similarity feature that we discussed a couple times could be a good workaround rather than creating a plugin. Reimplementing TF-IDF could actually be a good documentation example of it. |
Hello I fall into this category and I have empirical testing to back it up. Please contact me if you have any questions. For my data/application for Elasticsearch TF-IDF produces a better MRR (mean reciprocal rank, a common IR score for search engines) than BM25. If TF-IDF is not retained then I will be stuck on version 5.
|
We will make this deprecation smoother by adding a |
The `classic` similarity used to rely on specific features like query normalization and coord factors, which were specific to this similarity and have been removed as the `bm25` similarity has now been the default similarity for some time. As a consequence, using the `classic` similarity could lead to depeptive scores, which is why we want to prevent users from using it on new indices. `bm25` is generally considered a superior option. Closes elastic#23208
@jpountz I'm not entirely sure about the current state of this issue, is the "discuss" label still appropriate? What are the next steps if not? |
This improves the way similarities are plugged in in order to: - reject the classic similarity on 7.x indices and emit a deprecation warning otherwise - reject unkwown parameters on 7.x indices and emit a deprecation warning otherwise Even though this breaks the plugin API, I'd like to backport to 7.x so that users can get deprecation warnings when they are doing something that will become unsupported in the future. Closes elastic#23208 Closes elastic#29035
This improves the way similarities are plugged in in order to: - reject the classic similarity on 7.x indices and emit a deprecation warning otherwise - reject unkwown parameters on 7.x indices and emit a deprecation warning otherwise Even though this breaks the plugin API, I'd like to backport to 7.x so that users can get deprecation warnings when they are doing something that will become unsupported in the future. Closes #23208 Closes #29035
This improves the way similarities are plugged in in order to: - reject the classic similarity on 7.x indices and emit a deprecation warning otherwise - reject unkwown parameters on 7.x indices and emit a deprecation warning otherwise Even though this breaks the plugin API, I'd like to backport to 7.x so that users can get deprecation warnings when they are doing something that will become unsupported in the future. Closes elastic#23208 Closes elastic#29035
This improves the way similarities are plugged in in order to: - reject the classic similarity on 7.x indices and emit a deprecation warning otherwise - reject unkwown parameters on 7.x indices and emit a deprecation warning otherwise Even though this breaks the plugin API, I'd like to backport to 7.x so that users can get deprecation warnings when they are doing something that will become unsupported in the future. Closes #23208 Closes #29035
Sorry to leave a comment on an older issue, but I was wondering how the "scripts" solution would help replacing the missing coordinating factors? It seems that to be able to reproduce scoring that are aligned with how the previous classic similarity calculated them, the coordinating factors would be essential, and as far as I understand, those were calculated in the boolean weights. Thank you |
This feature of Lucene's TF-IDF similarity can't be reimplemented with a script indeed. For the record, note that this isn't part of the official definition of TF-IDF but something that has been added on top in order to work around the fact that the TF weighting would allow a document that contains many occurrences of a single query term to score better than documents that contain all query terms. This is no longer an issue with BM25 (and most other modern similarities) whose TF weighting is saturated. |
Thanks a lot @jpountz for the reply. I understand that coords were a work around for the fact that TF-IDF isn't great at favoring documents that matches all terms over a document that only matches a single term multiple time. I also understand that BM25 has better TF saturation which naturally helps with those scenarios. Unfortunately, I have the challenging goal of customizing the classic similarity algorithm in Es7 (Lucene8) such as the resulting scores are the same as Es5 (Lucene6, when coords were still a thing). I was originally hopping that coords could somewhat be re-implemented externally (or through the scripting features), but the concept of coords is so entrangled with the various Lucene scoring classes (boolean scorers, weights, parsers) that its proven to be much more challenging than I originally hopped. |
Replicating version 5 scoring in recent versions is not possible. |
BM25
should generally perform better thanTF-IDF
. Moreover, Lucene is removing coords and query normalization in 7.0 (that Elasticsearch will be based on) so we should start deprecating theclassic
similarity.The text was updated successfully, but these errors were encountered: