-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Make "german2" an alias for "german" snowball stemmer #113614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make "german2" an alias for "german" snowball stemmer #113614
Conversation
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
test/framework/src/main/java/org/elasticsearch/test/ESTokenStreamTestCase.java
Outdated
Show resolved
Hide resolved
...alysis-common/src/main/java/org/elasticsearch/analysis/common/StemmerTokenFilterFactory.java
Outdated
Show resolved
Hide resolved
@benwtrent @javanna since we decided we don't want to preserve any legacy behavior on old indices here, I reverted adding the 9x GermanStemmer with e34d43f. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left two nits, change looks good to me. Maybe we need to mark this breaking and add info to the changelog file in order to have this documented? Can you also check if we mention the german2 stemmer anywhere in the docs?
...s-common/src/test/java/org/elasticsearch/analysis/common/StemmerTokenFilterFactoryTests.java
Outdated
Show resolved
Hide resolved
...alysis-common/src/main/java/org/elasticsearch/analysis/common/StemmerTokenFilterFactory.java
Outdated
Show resolved
Hide resolved
@javanna thanks for the review, I addressed your comments and added a changelog entry. Let me know if that is all you had in mind before I merge. |
looks good! We'll need to backport the deprecation to 8x as well. Thanks! |
Thanks, regarding the deprecation backport I'm wondering if that's what we should do though. Deprecation should allow users to safely move to a new feature/API staying on the same version. If we deprecate "german2" in 8.x though, the user would need to move to "german" in order to avoid deprecation, but "german" has the old legacy behavior in 8.x (Lucene 9) still. I wonder if we should only deprecate in 9 and remove in 10 instead, in that case no backport. wdyt? |
Although we call it a deprecation in 9, the behaviour changes already? I would think we give the opportunity to move away from german2 before its behavior changes. Not too sure how useful that is. In practice these users would need to reindex to move away from it, and there is no need for them to do it in 8.x. @benwtrent do you have opinions? |
I don't understand. Deprecate in Elasticsearch 9 and remove in 10? We should 100% deprecate it in 8 (critical), indicating the behavior will change in 9, and keep it deprecated in 9 indicating the behavior HAS changed. |
I probably need to clarify: What was called "german2" in 8x will now be exposed as the "german" stemmer. |
The behavioral change happens in the "german" language stemmer. But we intend to remove the "german2" in favour of the "german" version. |
I think I understand your concerns. The situation isn't perfect, but I think that the two stemmers have been merged into one and there isn't one that replaces the others? I believe that what we called |
I do see the issue around deprecating |
That's an outcome of the decision to not provide a bwc treatment in 9 because this is a breaking upstream change in Snowball. |
I can see how backporting the deprecation may only lead to confusion. As long as release notes highlight the change, and we keep german2 through all of ES9, I think we are ok. Thank you for the due diligence @cbuescher |
@cbuescher is this PR relevant to the serverless changelog? [FYI this question is based on 9.0 breaking changes] |
Yes I think so. |
With Lucene 10, German2Stemmer, which is used as a parameter for the Snowball stemmer,
has been folded into GermanStemmer. This results mainly in different treatment Umlauts, i.e.
where formerly "german" would stem "Bücher" -> "Buch" but "Buecher" -> "Buech" and "german2"
would stem both to the same form "Buch", this is now true for the general "german" stemmer variant.
This change makes the defunct "german2" language stemmer an alias for the "german" stemmer that now
includes the same improved functionality.