java.lang.UnsupportedOperationException: null when using custom user dictionary for kuromoji_tokenizer #36100

ppf2 · 2018-11-30T01:38:43Z

6.5

To reproduce:

Create an index with a custom user dictionary for kuromoji_tokenizer:

{
  "analysis": {
    "tokenizer": {
      "custom_user_dictionary": {
        "type": "kuromoji_tokenizer",
        "mode": "normal",
        "discard_punctuation": "false",
        "user_dictionary": "test.csv"
      }
    },
    "analyzer": {
      "japanese": {
        "tokenizer": "custom_user_dictionary"
      },
      "japanese-search": {
        "tokenizer": "custom_user_dictionary"
      }
    }
  }
}

The above with the attached dictionary file test.txt results in an unsupported_operation_exception with no useful error messaging:

{
  "error": {
    "root_cause": [
      {
        "type": "unsupported_operation_exception",
        "reason": null
      }
    ],
    "type": "unsupported_operation_exception",
    "reason": null
  },
  "status": 500
}

Underlying Lucene/ES exception simply shows a null as well:

java.lang.UnsupportedOperationException: null
	at org.apache.lucene.util.fst.Outputs.merge(Outputs.java:97) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
	at org.apache.lucene.util.fst.Builder.add(Builder.java:445) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
	at org.apache.lucene.analysis.ja.dict.UserDictionary.<init>(UserDictionary.java:131) ~[?:?]
	at org.apache.lucene.analysis.ja.dict.UserDictionary.open(UserDictionary.java:81) ~[?:?]
	at org.elasticsearch.index.analysis.KuromojiTokenizerFactory.getUserDictionary(KuromojiTokenizerFactory.java:64) ~[?:?]
	at org.elasticsearch.index.analysis.KuromojiTokenizerFactory.<init>(KuromojiTokenizerFactory.java:50) ~[?:?]
	at org.elasticsearch.index.analysis.AnalysisRegistry.buildMapping(AnalysisRegistry.java:348) ~[elasticsearch-6.5.0.jar:6.5.0]
	at org.elasticsearch.index.analysis.AnalysisRegistry.buildTokenizerFactories(AnalysisRegistry.java:181) ~[elasticsearch-6.5.0.jar:6.5.0]
	at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:158) ~[elasticsearch-6.5.0.jar:6.5.0]
	at org.elasticsearch.index.IndexService.<init>(IndexService.java:165) ~[elasticsearch-6.5.0.jar:6.5.0]
	at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:397) ~[elasticsearch-6.5.0.jar:6.5.0]
	at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:503) ~[elasticsearch-6.5.0.jar:6.5.0]
	at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:457) ~[elasticsearch-6.5.0.jar:6.5.0]
	at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$IndexCreationTask.execute(MetaDataCreateIndexService.java:446) ~[elasticsearch-6.5.0.jar:6.5.0]
	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45) ~[elasticsearch-6.5.0.jar:6.5.0]
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:639) ~[elasticsearch-6.5.0.jar:6.5.0]
	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:268) ~[elasticsearch-6.5.0.jar:6.5.0]
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) [elasticsearch-6.5.0.jar:6.5.0]
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) [elasticsearch-6.5.0.jar:6.5.0]
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-6.5.0.jar:6.5.0]
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-6.5.0.jar:6.5.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.5.0.jar:6.5.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.5.0.jar:6.5.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.5.0.jar:6.5.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]

The issue here appears to be the duplicate entry in the user dictionary. Note that while I have already isolated this down to the duplication being an issue, for large user dictionaries it can be cumbersome to debug. It will be nice if we can handle duplicate entries automatically or provide a more useful message.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-11-30T01:38:44Z

Pinging @elastic/es-search

mickyharvey2 · 2018-12-01T20:37:46Z

@ppf2 having looked into your issue. It had been noticed that getUserDictionary is calls UserDictionary class. And the issue is within lucene.analysis.ja.JapaneseAnalyzer.java and not Elasticsearch: the Analyzer expects the UserDictionary class not the interface, and UserDictionary class calls methods that implements Outputs.merge default error message.

jimczi · 2018-12-03T10:31:09Z

@mickyharvey2 the issue is indeed in Lucene, I opened https://issues.apache.org/jira/browse/LUCENE-8584 and will update this issue when it is resolved in Lucene.

elasticsearchmachine · 2024-07-12T10:26:25Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

john-wagster · 2024-09-09T17:26:29Z

We definitely have a better error message now in newer versions of ES (testing on 8.15.1). I'll spend some cycles here seeing if we can deduplicate though. As that does seem like it would be ideal.

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Found duplicate term [最終契約] in user dictionary at line [1]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Found duplicate term [最終契約] in user dictionary at line [1]"
  },
  "status": 400
}

john-wagster · 2024-09-09T20:25:43Z

Related work previously done to handle the duplicates in ES: #103325

john-wagster · 2024-09-20T21:30:03Z

@ppf2 I added a flag to the Kuromoji (and Nori) tokenizers to deduplicate the user_dictionary with a new lenient flag which by default is false and, otherwise, an appropriate error is thrown with the line number of the duplicate. I believe this satisfies your original request. Let me know if you have any other questions or concerns and otherwise I'll close this ticket out.

Example of the new lenient flag:

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "kuromoji_user_dict": {
            "type": "kuromoji_tokenizer",
            "mode": "extended",
            "discard_punctuation": "false",
            "user_dictionary": "userdict_ja.txt",
            "lenient": "true"
          }
        },
...

ppf2 · 2024-09-22T04:48:59Z

SGTM. Thanks!

ppf2 added the :Search Relevance/Analysis How text is split into tokens label Nov 30, 2018

jimczi added the >bug label Nov 30, 2018

jimczi mentioned this issue Dec 7, 2018

Add support for inlined user dictionary in Nori #36123

Merged

rjernst added the Team:Search Meta label for search team label May 4, 2020

javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024

javanna added the priority:high A label for assessing bug priority to be used by ES engineers label Jul 18, 2024

john-wagster self-assigned this Sep 6, 2024

john-wagster mentioned this issue Sep 11, 2024

Deduplicate Nori and Kuromoji User Dictionary #112768

Merged

john-wagster closed this as completed Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

java.lang.UnsupportedOperationException: null when using custom user dictionary for kuromoji_tokenizer #36100

java.lang.UnsupportedOperationException: null when using custom user dictionary for kuromoji_tokenizer #36100

ppf2 commented Nov 30, 2018

elasticmachine commented Nov 30, 2018

mickyharvey2 commented Dec 1, 2018 •

edited

Loading

jimczi commented Dec 3, 2018

elasticsearchmachine commented Jul 12, 2024

john-wagster commented Sep 9, 2024

john-wagster commented Sep 9, 2024

john-wagster commented Sep 20, 2024 •

edited

Loading

ppf2 commented Sep 22, 2024

java.lang.UnsupportedOperationException: null when using custom user dictionary for kuromoji_tokenizer #36100

java.lang.UnsupportedOperationException: null when using custom user dictionary for kuromoji_tokenizer #36100

Comments

ppf2 commented Nov 30, 2018

elasticmachine commented Nov 30, 2018

mickyharvey2 commented Dec 1, 2018 • edited Loading

jimczi commented Dec 3, 2018

elasticsearchmachine commented Jul 12, 2024

john-wagster commented Sep 9, 2024

john-wagster commented Sep 9, 2024

john-wagster commented Sep 20, 2024 • edited Loading

ppf2 commented Sep 22, 2024

mickyharvey2 commented Dec 1, 2018 •

edited

Loading

john-wagster commented Sep 20, 2024 •

edited

Loading