Support kuromoji user dictionary set directly in the settings file #25343

tatsuya · 2017-06-21T23:12:27Z

It would be nice if kuromoji_tokenizer supports loading user dictionary via array of dictionary entries in the settings json directly, not only from the file.

Current settings example looks like the below:

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "kuromoji_user_dict": {
            "type": "kuromoji_tokenizer",
            "mode": "extended",
            "discard_punctuation": "false",
            "user_dictionary": "userdict_ja.txt"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "kuromoji_user_dict"
          }
        }
      }
    }
  }
}

My suggestion is to have new json property named user_dictionary_entires (or similar) at the same level of current user_dictionary, and it accepts the array of dictionary entries. If both user_dictionary and user_dictionary_entries given, then it has to either merge both inputs or use only one of them though, I think simply prioritize one of those inputs would be simpler. This is actually pretty similar to the way the Synonym Token Filter supports already.

So the new json format would be:

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "kuromoji_user_dict": {
            "type": "kuromoji_tokenizer",
            "mode": "extended",
            "discard_punctuation": "false",
            "user_dictionary_entires": [
              "東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞",
              "..."
            ]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "kuromoji_user_dict"
          }
        }
      }
    }
  }
}

If this sounds good to you, I can create a pull request anytime. Thank you!

The text was updated successfully, but these errors were encountered:

tlrx · 2017-06-22T08:54:23Z

@tatsuyaoiw Thanks for your suggestion. I marked this issue as "discuss", meaning it get on hold until some team members knowledgeable in analyzers can make a proper answer :)

tatsuya · 2017-06-22T22:31:34Z

@tlrx Cool! Looking forward to hear back.

cbuescher · 2017-06-23T13:30:05Z

@tatsuyaoiw thanks for the suggestion, we discussed this about it sounds reasonable as long as the user dictionary remains relatively small. I don't know how practical this is though, maybe @johtani has some thoughts about this? In any case, can you briefly describe how you are planning to use that feature? Is it only for testing or to extend the tokenizer dictionary in production?

tatsuya · 2017-06-23T14:55:28Z

@cbuescher Thanks. This should primarily be for testing purpose but in case if it just works with small number of additional dictionary entries it can also be used in production. I think this gives a lot more flexibility because we don't need to have any dependency on the filesystem.

johtani · 2017-06-26T08:06:40Z

I think user dictionary is not small usually.
We can support it but we should have some limitation for user dictionary size.
And we have to change Lucene's source, because JapaneseTokenizerFactory only support the file of dictionary.

romseygeek · 2018-03-13T12:30:23Z

cc @elastic/es-search-aggs

cbuescher · 2019-04-09T09:46:48Z

Re-visting this, the Nori tokenizer now supports custom inline dictionaries (#35842), so we should treat Kuromoji similarly and add this feature there as well (keeping in mind @johtani's remark that we should maybe have some limitation on the dictionary size).

This change adds a new option called user_dictionary_rules to Kuromoji's tokenizer. It can be used to set additional tokenization rules to the Japanese tokenizer directly in the settings (instead of using a file). This commit also adds a check that no rules are duplicated since this is not allowed in the UserDictionary. Closes elastic#25343

This change adds a new option called user_dictionary_rules to Kuromoji's tokenizer. It can be used to set additional tokenization rules to the Japanese tokenizer directly in the settings (instead of using a file). This commit also adds a check that no rules are duplicated since this is not allowed in the UserDictionary. Closes #25343

tlrx added :Search Relevance/Analysis How text is split into tokens discuss labels Jun 22, 2017

colings86 added the :Plugin Analysis Kuromoji label Jun 23, 2017

cbuescher removed the discuss label Jun 23, 2017

lcawl added :Search Relevance/Analysis How text is split into tokens and removed :Plugin Analysis Kuromoji labels Feb 13, 2018

colings86 added the >feature label Apr 24, 2018

cbuescher added the help wanted adoptme label Apr 9, 2019

jimczi removed the help wanted adoptme label Aug 13, 2019

jimczi mentioned this issue Aug 13, 2019

Add support for inlined user dictionary in the Kuromoji plugin #45489

Merged

jimczi closed this as completed in #45489 Aug 20, 2019

codebrain mentioned this issue Oct 14, 2019

7.4 meta ticket elastic/elasticsearch-net#4133

Closed

56 tasks

javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support kuromoji user dictionary set directly in the settings file #25343

Support kuromoji user dictionary set directly in the settings file #25343

tatsuya commented Jun 21, 2017

tlrx commented Jun 22, 2017

tatsuya commented Jun 22, 2017

cbuescher commented Jun 23, 2017

tatsuya commented Jun 23, 2017

johtani commented Jun 26, 2017

romseygeek commented Mar 13, 2018

cbuescher commented Apr 9, 2019

Support kuromoji user dictionary set directly in the settings file #25343

Support kuromoji user dictionary set directly in the settings file #25343

Comments

tatsuya commented Jun 21, 2017

tlrx commented Jun 22, 2017

tatsuya commented Jun 22, 2017

cbuescher commented Jun 23, 2017

tatsuya commented Jun 23, 2017

johtani commented Jun 26, 2017

romseygeek commented Mar 13, 2018

cbuescher commented Apr 9, 2019