Skip to content

Add limits for ngram and shingle settings #25887

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
colings86 opened this issue Jul 25, 2017 · 13 comments
Closed

Add limits for ngram and shingle settings #25887

colings86 opened this issue Jul 25, 2017 · 13 comments
Assignees
Labels
:Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@colings86
Copy link
Contributor

Currently the options for ngram and shingle tokenizers/token filters allow the user to set min_size and max_size to any values. This is dangerous as users can set values which produces huge numbers of terms and at best bloat their index but at worst cause problems such as #25841.

I think we should add soft (and/or maybe hard) limits so that neither min_size or max_size can be more than say 6 and the difference between min_size and max_size can't be more than 2 or 3 (we may even want to make this limit 1).

Note that this does not apply to edge_ngrams where it is useful to have higher values and a larger difference between min and max values. We should probably decide if there should be different limits here though.

@colings86 colings86 added :Search Relevance/Analysis How text is split into tokens discuss labels Jul 25, 2017
@colings86 colings86 changed the title Add limits for ngram and single settings Add limits for ngram and shingle settings Jul 25, 2017
@nik9000
Copy link
Member

nik9000 commented Jul 25, 2017

I think this is worth having a look at. A soft limit will catch anyone just playing around and give back an error so they know this can be problematic.

@martijnvg
Copy link
Member

We discussed during fixit friday and we all agreed on the fact that there should be a soft limit for the difference between the min_size and max_size settings. In case of the ngram tokenizer / token filter the soft limit should be 0 and the shingle token filter's soft limit should be 3. For the edge_ngram tokenizer / token filter there should be no soft limit.

@martijnvg martijnvg added help wanted adoptme and removed discuss labels Jul 28, 2017
@matarrese
Copy link
Contributor

matarrese commented Aug 7, 2017

Hello guys,
I was having a look at this issue.
Should be raised a new IllegalArgumentException in case the conditions are not satisfied?
Is the constructor of ShingleTokenFilter a good place to do these validations?

@mayya-sharipova mayya-sharipova self-assigned this Oct 25, 2017
@mayya-sharipova mayya-sharipova removed the help wanted adoptme label Oct 25, 2017
@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Oct 25, 2017

I guess we will raise IllegalArgumentException if conditions are not satisfied .
I was thinking we can do a check at the level of org.elasticsearch.index.analysis.ShingleTokenFilterFactory and org.elasticsearch.index.analysis.NGramTokenizerFactory.

@colings86
Copy link
Contributor Author

@mayya-sharipova If we change the settings in those classes to use org.elasticsearch.common.settings.Setting.intSetting(String, int, int, int, Property...) instead of Settings.getAsInt() then we should be able to define and validate the min and max values using the Settings infrastructure and the exception will be consistent with other similar validation errors thrown form settings.

@jimczi
Copy link
Contributor

jimczi commented Oct 26, 2017

Analyzers use a group setting (analysis.*) to register their settings. We don't check the inner setting for analyzers which is why we have to use Settings.getAs.... We should at some point move the analyzers options to checked settings (org.elasticsearch.common.settings.Setting) but that would be a breaking change for plugins devs and not a low hanging fruit.
For this issue I think it's fine to check the conditions in the factories directly and throw an IllegalArgumentException if conditions are not met as @mayya-sharipova suggests.
I'll open an issue to discuss how we could validate inner settings for analyzers but that's not a low hanging fruit ;).

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Oct 27, 2017

@jimczi @colings86 Do we want to create a new Setting for the difference between min_size and max_size? If yes, where do keep this setting?
If no, does it mean that will always throw an exception for example when Ngram min_gram != max_gram?

When we say " soft limit should be 0" what do we mean? Does it mean always throwing an exception when the limit is not 0?

@colings86
Copy link
Contributor Author

@mayya-sharipova I think the soft-limit for the difference between min_size and max_size should be a setting that is set on the index as a whole rather than a setting on each ngram token filter individually and its value should default to 0. We should then check this in the constructor of NGramTokenizerFactory and throw an IllegalArgumentException if the difference exceeds the settings value.

mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Nov 3, 2017
Create index-level settings:
max_ngram_diff - maximum allowed difference between max_gram and min_gram in
NGramTokenFilter/NGramTokenizer. Default is 1.
max_shingle_diff - maximum allowed difference between max_shingle_size and
 min_shingle_size in ShingleTokenFilter.  Default is 3.

Throw an IllegalArgumentException when
trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter
where difference between max_size and min_size exceeds the settings value.

Closes elastic#25887
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Nov 3, 2017
Create index-level settings:
max_ngram_diff - maximum allowed difference between max_gram and min_gram in
NGramTokenFilter/NGramTokenizer. Default is 1.
max_shingle_diff - maximum allowed difference betweenmax_shingle_size and
 min_shingle_size in ShingleTokenFilter.  Default is 3.

Throw an IllegalArgumentException from v7.X when
trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter
where difference between max_size and min_size exceeds the settings value.
Create a deprecated warning for versions < 7.X for this case.

Closes elastic#25887
mayya-sharipova added a commit that referenced this issue Nov 7, 2017
* Add limits for ngram and shingle settings (#27211)

Create index-level settings:
max_ngram_diff - maximum allowed difference between max_gram and min_gram in
NGramTokenFilter/NGramTokenizer. Default is 1.
max_shingle_diff - maximum allowed difference between max_shingle_size and
 min_shingle_size in ShingleTokenFilter.  Default is 3.

Throw an IllegalArgumentException when
trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter
where difference between max_size and min_size exceeds the settings value.

Closes #25887
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Nov 8, 2017
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Nov 16, 2017
Create index-level settings:
max_ngram_diff - maximum allowed difference between max_gram and min_gram in
NGramTokenFilter/NGramTokenizer. Default is 1.
max_shingle_diff - maximum allowed difference between max_shingle_size and
 min_shingle_size in ShingleTokenFilter.  Default is 3.

Log a warning when
trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter
where difference between max_size and min_size exceeds the settings's value.

Closes elastic#25887
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Nov 16, 2017
Create index-level settings:
max_ngram_diff - maximum allowed difference between max_gram and min_gram in
NGramTokenFilter/NGramTokenizer. Default is 1.
max_shingle_diff - maximum allowed difference between max_shingle_size and
min_shingle_size in ShingleTokenFilter.  Default is 3.

Log a warning when
trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter
where difference between max_size and min_size exceeds the settings's value.

Closes elastic#25887
mayya-sharipova added a commit that referenced this issue Nov 17, 2017
Create index-level settings:
max_ngram_diff - maximum allowed difference between max_gram and min_gram in
NGramTokenFilter/NGramTokenizer. Default is 1.
max_shingle_diff - maximum allowed difference between max_shingle_size and
 min_shingle_size in ShingleTokenFilter.  Default is 3.

Log a warning when
trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter
where difference between max_size and min_size exceeds the settings's value.

Closes #25887
@nealvale
Copy link

nealvale commented Mar 21, 2018

I feel that this is going to cause issues because of the following:

If I need to do a contains search over something like "AA-PP-001-002" I will be restricted by it. For instance - min ngram = 3, max ngram = 4
Would result in - AA-, A-P, ..... AA-P, A-PP, ....
I would not find a string like AA-PP - or P-001
And I don't see any suggested alternatives. I saw the extreme case of max = 6000. Clearly, this should be a developer discretion. Deprecating without an alternative is not a solution IMHO.

@jimczi
Copy link
Contributor

jimczi commented Mar 21, 2018

Clearly, this should be a developer discretion

The solution implemented just adds a setting to limit the allowed diff between min_ngram and max_ngram to 1 by default. You can change this setting at your discretion ;)

@colings86
Copy link
Contributor Author

Note that as detailed in the linked PR (#27411), these limits are backed by an index level setting that can be changed if needed, in the case of ngrams the setting is index.max_ngram_diff which controls what the maximum difference between max_ngram and min_ngram can be. The intention with these limits is to help cluster administrators have limits on the usage of their cluster that are sensible for the resources available to the cluster. Use cases with a wide difference between min and max ngrams are few and far between and those use cases can still modify the setting from its default.

@colings86
Copy link
Contributor Author

Also note the in your example a query for say AA-PP would, after analysis of the query, search for AA-, AA-P, A-P, A-PP, -PP which would all be present in a document containing a field with AA-PP-001-002 sincee the ngram analysis would be done at both index and search time

@nealvale
Copy link

@colings86 - Thanks for your reply. Ok thats great I was under the impression that the deprecation message indicated that the diff would eventually disappear and we would be forced into using the max diff of 1. As long as this is just a cautionary message I am fine.

@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

8 participants