-
Notifications
You must be signed in to change notification settings - Fork 25.2k
java.lang.UnsupportedOperationException: null when using custom user dictionary for kuromoji_tokenizer #36100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-search |
@ppf2 having looked into your issue. It had been noticed that getUserDictionary is calls UserDictionary class. And the issue is within lucene.analysis.ja.JapaneseAnalyzer.java and not Elasticsearch: the Analyzer expects the UserDictionary class not the interface, and UserDictionary class calls methods that implements Outputs.merge default error message. |
@mickyharvey2 the issue is indeed in Lucene, I opened https://issues.apache.org/jira/browse/LUCENE-8584 and will update this issue when it is resolved in Lucene. |
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
We definitely have a better error message now in newer versions of ES (testing on 8.15.1). I'll spend some cycles here seeing if we can deduplicate though. As that does seem like it would be ideal.
|
Related work previously done to handle the duplicates in ES: #103325 |
@ppf2 I added a flag to the Kuromoji (and Nori) tokenizers to deduplicate the Example of the new
|
SGTM. Thanks! |
6.5
To reproduce:
Create an index with a custom user dictionary for kuromoji_tokenizer:
The above with the attached dictionary file test.txt results in an
unsupported_operation_exception
with no useful error messaging:Underlying Lucene/ES exception simply shows a null as well:
The issue here appears to be the duplicate entry in the user dictionary. Note that while I have already isolated this down to the duplication being an issue, for large user dictionaries it can be cumbersome to debug. It will be nice if we can handle duplicate entries automatically or provide a more useful message.
The text was updated successfully, but these errors were encountered: