Skip to content

Make "german2" an alias for "german" snowball stemmer #113614

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Oct 1, 2024

Conversation

cbuescher
Copy link
Member

@cbuescher cbuescher commented Sep 26, 2024

With Lucene 10, German2Stemmer, which is used as a parameter for the Snowball stemmer,
has been folded into GermanStemmer. This results mainly in different treatment Umlauts, i.e.
where formerly "german" would stem "Bücher" -> "Buch" but "Buecher" -> "Buech" and "german2"
would stem both to the same form "Buch", this is now true for the general "german" stemmer variant.

This change makes the defunct "german2" language stemmer an alias for the "german" stemmer that now
includes the same improved functionality.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@cbuescher cbuescher requested a review from benwtrent September 26, 2024 14:24
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Sep 26, 2024
@cbuescher
Copy link
Member Author

cbuescher commented Sep 30, 2024

@benwtrent @javanna since we decided we don't want to preserve any legacy behavior on old indices here, I reverted adding the 9x GermanStemmer with e34d43f.
Let me know if there is anything missing.

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left two nits, change looks good to me. Maybe we need to mark this breaking and add info to the changelog file in order to have this documented? Can you also check if we mention the german2 stemmer anywhere in the docs?

@cbuescher cbuescher changed the title Add bwc for "german" and "german2" snowball stemming Make "german2" an alias for "german" snowball stemmer Sep 30, 2024
@cbuescher
Copy link
Member Author

@javanna thanks for the review, I addressed your comments and added a changelog entry. Let me know if that is all you had in mind before I merge.

@javanna
Copy link
Member

javanna commented Sep 30, 2024

looks good! We'll need to backport the deprecation to 8x as well. Thanks!

@cbuescher cbuescher merged commit ef2d0b8 into elastic:lucene_snapshot Oct 1, 2024
15 checks passed
@cbuescher cbuescher deleted the german-stemmer-bwc branch October 1, 2024 08:41
@cbuescher
Copy link
Member Author

Thanks, regarding the deprecation backport I'm wondering if that's what we should do though.

Deprecation should allow users to safely move to a new feature/API staying on the same version. If we deprecate "german2" in 8.x though, the user would need to move to "german" in order to avoid deprecation, but "german" has the old legacy behavior in 8.x (Lucene 9) still. I wonder if we should only deprecate in 9 and remove in 10 instead, in that case no backport. wdyt?

@javanna
Copy link
Member

javanna commented Oct 1, 2024

Although we call it a deprecation in 9, the behaviour changes already? I would think we give the opportunity to move away from german2 before its behavior changes. Not too sure how useful that is. In practice these users would need to reindex to move away from it, and there is no need for them to do it in 8.x. @benwtrent do you have opinions?

@benwtrent
Copy link
Member

benwtrent commented Oct 1, 2024

. I wonder if we should only deprecate in 9 and remove in 10 instead, in that case no backport. wdyt?

I don't understand. Deprecate in Elasticsearch 9 and remove in 10?

We should 100% deprecate it in 8 (critical), indicating the behavior will change in 9, and keep it deprecated in 9 indicating the behavior HAS changed.

@cbuescher
Copy link
Member Author

Although we call it a deprecation in 9, the behaviour changes already?

I probably need to clarify:

What was called "german2" in 8x will now be exposed as the "german" stemmer.
The old "german" stemmer is the one with the behaviour that goes away. Its behaviour is now the stemmer that was formerly called "german2".
If we deprecated the "german2" name in 8x, I would assume users should have a way to migrate away from using that in the same version already. But they cannot because the "german" stemmer has the legacy behaviour.
In 9 both names point to the same functionality, but we can deprecate the "german2" name and remove at some point later.
Does that make more sense?

@cbuescher
Copy link
Member Author

We should 100% deprecate it in 8 (critical), indicating the behavior will change in 9

The behavioral change happens in the "german" language stemmer. But we intend to remove the "german2" in favour of the "german" version.
Should we consider deprecating "german" in 8.x and 9.x and remove that one and leave the "german2" in place, since that is the one that doesn't change?

@javanna
Copy link
Member

javanna commented Oct 1, 2024

I think I understand your concerns. The situation isn't perfect, but I think that the two stemmers have been merged into one and there isn't one that replaces the others? I believe that what we called german will get new behaviour from 9.0 as well? Maybe we need to split how we signal that to users then: there is a behavioral change in 9.0 with german. We should though warn users in 8.x already that german2 will no longer be a thing in 9.0, although we keep it deprecated for the 9.x series.

@javanna
Copy link
Member

javanna commented Oct 1, 2024

I do see the issue around deprecating german2 while german does not have its functionality folded in yet. We'd deprecate but users don't have a clean replacement until 9.0 ?

@cbuescher
Copy link
Member Author

We'd deprecate but users don't have a clean replacement until 9.0

That's an outcome of the decision to not provide a bwc treatment in 9 because this is a breaking upstream change in Snowball.
Without the bwc layer that I removed with e34d43f, users would now need to reindex from an index with "german" stemmer to one with "german2" while on 8.x, then switch that name back to "german" at some point in 9 before the "german2" name is removed. Its a bit odd but that's how I read the decision atm.

@benwtrent
Copy link
Member

I can see how backporting the deprecation may only lead to confusion. As long as release notes highlight the change, and we keep german2 through all of ES9, I think we are ok.

Thank you for the due diligence @cbuescher

@leemthompo
Copy link
Contributor

@cbuescher is this PR relevant to the serverless changelog? [FYI this question is based on 9.0 breaking changes]

@cbuescher
Copy link
Member Author

is this PR relevant to the serverless changelog?

Yes I think so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>breaking :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants