-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Add limits for ngram and shingle settings #25887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think this is worth having a look at. A soft limit will catch anyone just playing around and give back an error so they know this can be problematic. |
We discussed during fixit friday and we all agreed on the fact that there should be a soft limit for the difference between the |
Hello guys, |
I guess we will raise |
@mayya-sharipova If we change the settings in those classes to use |
Analyzers use a group setting ( |
@jimczi @colings86 Do we want to create a new Setting for the difference between When we say " soft limit should be 0" what do we mean? Does it mean always throwing an exception when the limit is not 0? |
@mayya-sharipova I think the soft-limit for the difference between |
Create index-level settings: max_ngram_diff - maximum allowed difference between max_gram and min_gram in NGramTokenFilter/NGramTokenizer. Default is 1. max_shingle_diff - maximum allowed difference between max_shingle_size and min_shingle_size in ShingleTokenFilter. Default is 3. Throw an IllegalArgumentException when trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter where difference between max_size and min_size exceeds the settings value. Closes elastic#25887
Create index-level settings: max_ngram_diff - maximum allowed difference between max_gram and min_gram in NGramTokenFilter/NGramTokenizer. Default is 1. max_shingle_diff - maximum allowed difference betweenmax_shingle_size and min_shingle_size in ShingleTokenFilter. Default is 3. Throw an IllegalArgumentException from v7.X when trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter where difference between max_size and min_size exceeds the settings value. Create a deprecated warning for versions < 7.X for this case. Closes elastic#25887
* Add limits for ngram and shingle settings (#27211) Create index-level settings: max_ngram_diff - maximum allowed difference between max_gram and min_gram in NGramTokenFilter/NGramTokenizer. Default is 1. max_shingle_diff - maximum allowed difference between max_shingle_size and min_shingle_size in ShingleTokenFilter. Default is 3. Throw an IllegalArgumentException when trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter where difference between max_size and min_size exceeds the settings value. Closes #25887
Create index-level settings: max_ngram_diff - maximum allowed difference between max_gram and min_gram in NGramTokenFilter/NGramTokenizer. Default is 1. max_shingle_diff - maximum allowed difference between max_shingle_size and min_shingle_size in ShingleTokenFilter. Default is 3. Log a warning when trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter where difference between max_size and min_size exceeds the settings's value. Closes elastic#25887
Create index-level settings: max_ngram_diff - maximum allowed difference between max_gram and min_gram in NGramTokenFilter/NGramTokenizer. Default is 1. max_shingle_diff - maximum allowed difference between max_shingle_size and min_shingle_size in ShingleTokenFilter. Default is 3. Log a warning when trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter where difference between max_size and min_size exceeds the settings's value. Closes elastic#25887
Create index-level settings: max_ngram_diff - maximum allowed difference between max_gram and min_gram in NGramTokenFilter/NGramTokenizer. Default is 1. max_shingle_diff - maximum allowed difference between max_shingle_size and min_shingle_size in ShingleTokenFilter. Default is 3. Log a warning when trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter where difference between max_size and min_size exceeds the settings's value. Closes #25887
I feel that this is going to cause issues because of the following: If I need to do a contains search over something like "AA-PP-001-002" I will be restricted by it. For instance - min ngram = 3, max ngram = 4 |
The solution implemented just adds a setting to limit the allowed diff between |
Note that as detailed in the linked PR (#27411), these limits are backed by an index level setting that can be changed if needed, in the case of ngrams the setting is |
Also note the in your example a query for say |
@colings86 - Thanks for your reply. Ok thats great I was under the impression that the deprecation message indicated that the diff would eventually disappear and we would be forced into using the max diff of 1. As long as this is just a cautionary message I am fine. |
Currently the options for ngram and shingle tokenizers/token filters allow the user to set
min_size
andmax_size
to any values. This is dangerous as users can set values which produces huge numbers of terms and at best bloat their index but at worst cause problems such as #25841.I think we should add soft (and/or maybe hard) limits so that neither
min_size
ormax_size
can be more than say 6 and the difference betweenmin_size
andmax_size
can't be more than 2 or 3 (we may even want to make this limit 1).Note that this does not apply to
edge_ngrams
where it is useful to have higher values and a larger difference between min and max values. We should probably decide if there should be different limits here though.The text was updated successfully, but these errors were encountered: