-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Support kuromoji user dictionary set directly in the settings file #25343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@tatsuyaoiw Thanks for your suggestion. I marked this issue as "discuss", meaning it get on hold until some team members knowledgeable in analyzers can make a proper answer :) |
@tlrx Cool! Looking forward to hear back. |
@tatsuyaoiw thanks for the suggestion, we discussed this about it sounds reasonable as long as the user dictionary remains relatively small. I don't know how practical this is though, maybe @johtani has some thoughts about this? In any case, can you briefly describe how you are planning to use that feature? Is it only for testing or to extend the tokenizer dictionary in production? |
@cbuescher Thanks. This should primarily be for testing purpose but in case if it just works with small number of additional dictionary entries it can also be used in production. I think this gives a lot more flexibility because we don't need to have any dependency on the filesystem. |
I think user dictionary is not small usually. |
cc @elastic/es-search-aggs |
This change adds a new option called user_dictionary_rules to Kuromoji's tokenizer. It can be used to set additional tokenization rules to the Japanese tokenizer directly in the settings (instead of using a file). This commit also adds a check that no rules are duplicated since this is not allowed in the UserDictionary. Closes elastic#25343
This change adds a new option called user_dictionary_rules to Kuromoji's tokenizer. It can be used to set additional tokenization rules to the Japanese tokenizer directly in the settings (instead of using a file). This commit also adds a check that no rules are duplicated since this is not allowed in the UserDictionary. Closes #25343
This change adds a new option called user_dictionary_rules to Kuromoji's tokenizer. It can be used to set additional tokenization rules to the Japanese tokenizer directly in the settings (instead of using a file). This commit also adds a check that no rules are duplicated since this is not allowed in the UserDictionary. Closes #25343
It would be nice if kuromoji_tokenizer supports loading user dictionary via array of dictionary entries in the settings json directly, not only from the file.
Current settings example looks like the below:
My suggestion is to have new json property named
user_dictionary_entires
(or similar) at the same level of currentuser_dictionary
, and it accepts the array of dictionary entries. If bothuser_dictionary
anduser_dictionary_entries
given, then it has to either merge both inputs or use only one of them though, I think simply prioritize one of those inputs would be simpler. This is actually pretty similar to the way the Synonym Token Filter supports already.So the new json format would be:
If this sounds good to you, I can create a pull request anytime. Thank you!
The text was updated successfully, but these errors were encountered: