Skip to content

Support kuromoji user dictionary set directly in the settings file #25343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tatsuya opened this issue Jun 21, 2017 · 7 comments · Fixed by #45489
Closed

Support kuromoji user dictionary set directly in the settings file #25343

tatsuya opened this issue Jun 21, 2017 · 7 comments · Fixed by #45489
Labels
>feature :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@tatsuya
Copy link

tatsuya commented Jun 21, 2017

It would be nice if kuromoji_tokenizer supports loading user dictionary via array of dictionary entries in the settings json directly, not only from the file.

Current settings example looks like the below:

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "kuromoji_user_dict": {
            "type": "kuromoji_tokenizer",
            "mode": "extended",
            "discard_punctuation": "false",
            "user_dictionary": "userdict_ja.txt"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "kuromoji_user_dict"
          }
        }
      }
    }
  }
}

My suggestion is to have new json property named user_dictionary_entires (or similar) at the same level of current user_dictionary, and it accepts the array of dictionary entries. If both user_dictionary and user_dictionary_entries given, then it has to either merge both inputs or use only one of them though, I think simply prioritize one of those inputs would be simpler. This is actually pretty similar to the way the Synonym Token Filter supports already.

So the new json format would be:

{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "kuromoji_user_dict": {
            "type": "kuromoji_tokenizer",
            "mode": "extended",
            "discard_punctuation": "false",
            "user_dictionary_entires": [
              "東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞",
              "..."
            ]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "kuromoji_user_dict"
          }
        }
      }
    }
  }
}

If this sounds good to you, I can create a pull request anytime. Thank you!

@tlrx tlrx added :Search Relevance/Analysis How text is split into tokens discuss labels Jun 22, 2017
@tlrx
Copy link
Member

tlrx commented Jun 22, 2017

@tatsuyaoiw Thanks for your suggestion. I marked this issue as "discuss", meaning it get on hold until some team members knowledgeable in analyzers can make a proper answer :)

@tatsuya
Copy link
Author

tatsuya commented Jun 22, 2017

@tlrx Cool! Looking forward to hear back.

@cbuescher
Copy link
Member

@tatsuyaoiw thanks for the suggestion, we discussed this about it sounds reasonable as long as the user dictionary remains relatively small. I don't know how practical this is though, maybe @johtani has some thoughts about this? In any case, can you briefly describe how you are planning to use that feature? Is it only for testing or to extend the tokenizer dictionary in production?

@tatsuya
Copy link
Author

tatsuya commented Jun 23, 2017

@cbuescher Thanks. This should primarily be for testing purpose but in case if it just works with small number of additional dictionary entries it can also be used in production. I think this gives a lot more flexibility because we don't need to have any dependency on the filesystem.

@johtani
Copy link
Contributor

johtani commented Jun 26, 2017

I think user dictionary is not small usually.
We can support it but we should have some limitation for user dictionary size.
And we have to change Lucene's source, because JapaneseTokenizerFactory only support the file of dictionary.

@lcawl lcawl added :Search Relevance/Analysis How text is split into tokens and removed :Plugin Analysis Kuromoji labels Feb 13, 2018
@romseygeek
Copy link
Contributor

cc @elastic/es-search-aggs

@cbuescher cbuescher added the help wanted adoptme label Apr 9, 2019
@cbuescher
Copy link
Member

Re-visting this, the Nori tokenizer now supports custom inline dictionaries (#35842), so we should treat Kuromoji similarly and add this feature there as well (keeping in mind @johtani's remark that we should maybe have some limitation on the dictionary size).

jimczi added a commit to jimczi/elasticsearch that referenced this issue Aug 13, 2019
This change adds a new option called user_dictionary_rules to
Kuromoji's tokenizer. It can be used to set additional tokenization rules
to the Japanese tokenizer directly in the settings (instead of using a file).
This commit also adds a check that no rules are duplicated since this is not allowed
in the UserDictionary.

Closes elastic#25343
@jimczi jimczi removed the help wanted adoptme label Aug 13, 2019
jimczi added a commit that referenced this issue Aug 20, 2019
This change adds a new option called user_dictionary_rules to
Kuromoji's tokenizer. It can be used to set additional tokenization rules
to the Japanese tokenizer directly in the settings (instead of using a file).
This commit also adds a check that no rules are duplicated since this is not allowed
in the UserDictionary.

Closes #25343
jimczi added a commit that referenced this issue Aug 21, 2019
This change adds a new option called user_dictionary_rules to
Kuromoji's tokenizer. It can be used to set additional tokenization rules
to the Japanese tokenizer directly in the settings (instead of using a file).
This commit also adds a check that no rules are duplicated since this is not allowed
in the UserDictionary.

Closes #25343
@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants