-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Ngram/Edgengram filters don't work with keyword repeat filters #22478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Since you use the {
"tokens" : [ {
"token" : "Is",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "this",
"start_offset" : 3,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "deja",
"start_offset" : 8,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "vu",
"start_offset" : 13,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 3
} ]
} And then if you do the folding and {
"tokens" : [ {
"token" : "is",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "is",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "this",
"start_offset" : 3,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "this",
"start_offset" : 3,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "deja",
"start_offset" : 8,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "deja",
"start_offset" : 8,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "vu",
"start_offset" : 13,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "vu",
"start_offset" : 13,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 3
} ]
} If you then try to do edge ngrams for the If you want to keep the whitespace, perhaps inject a shingle token filter (with |
I've run into this same issue before. @mikemccand what do you think? |
FWIW, my current work around is to do always use the lang specific analysis field when I think I'm searching in a non-whitespace separated language (but I don't really trust my lang detection), and I use the lang specific field anytime the text is less than 3 chars, or if the trailing word is less than three chars (eg a search like "math pi" ). Shingle tokens as a work around will still have the same problem of not letting me have sub 3 char tokens i think. I also suspect it would blow up the index size even more than including 1 and 2 char edgengrams would. BTW, if we change this, can it be easily back ported to 2.x ;) |
Heh, and now we've found a case where my workarounds don't work: "Game of Thrones". Any updates here? I guess I could just adjust to edgengrams starting from 1 char just seems likely to cause lots of inefficiencies. Shingle tokens sounds interesting (and maybe improves relevancy) but will also significantly increase index size. |
Another idea (for anyone following along). I could have one edgegrams field per language and then specify a language analyzer that has stop words for that language. Would fix the worst cases, but still not fix something like "pi". |
@gibrown Can you please confirm what tokens do you expect when you index ""Is this déjà vu?" "Is t"
|
cc @elastic/es-search-aggs |
For edgengrams on "Is this déjà vu?" I would only expect the following tokens: "is t" and "is th" would not be in the index.
No we are using the icu_tokenizer. We are doing indexing across all languages. Technically we should even be using special tokenization for Japanese, Korean, and Chinese so we can get the tokenization correct there. Thanks for taking a look. Our workaround that we have deployed is to search both the edgengram field and an icu tokenized field that doesn't have any ngrams. We do this with a multi_match query that uses the cross_fields and AND as the operator. Makes for a more expensive query but it kinda works. |
@gibrown If you found the workaround, would you mind if I close this issue? |
I still think that some way to index edgengrams from X-Y plus also the original token would be a very worthwhile improvement. I would use it if it were available. I still think the keyword_repeat is the closest approximation. My workaround breaks if i am trying to do a phrase match. For instance: Technically what I would love is a clearer language that lets me have multiple flows for extracting tokens:
This lets me do an AND match on multiple tokens as well as a phrase match. Having them be in multiple fields has a number of drawbacks. |
I've been doing some work on making branches possible in TokenStreams (see https://issues.apache.org/jira/browse/LUCENE-8273). If that were combined with a generalisation of KeywordRepeatFilter, we could build an analysis chain that looked something like:
|
@romseygeek I love the idea of being able to have multiple paths of processing tokens. This would help in a lot of cases I've seen I think. It feels like the analysis syntax would need a bit more structure than it currently has to handle this sort of thing. |
We had exactly the same issue, problem is that not all filters support the |
Added in #31208 |
Very excited that this is in 6.4. Thanks @romseygeek nice work. |
Elasticsearch version: 2.3.3
Plugins installed: [analysis-icu analysis-smartcn delete-by-query lang-javascript whatson analysis-kuromoji analysis-stempel elasticsearch-inquisitor head langdetect statsd/]
Description of the problem including expected versus actual behavior:
I want to index edgengrams from 3 to 15 chars, but also keep the original token in the field as well. This is being used for search as you type functionality. For both speed and relevancy reasons we've settled on 3 being the min num of chars that makes sense, but it leaves some gaps for non-whitespace separated languages and for words like 'pi'.
I thought I could do this using keyword_repeat and unique filters in my analyzer, but that doesn't seem to work with edgengram filters. Maybe I'm doing it wrong, but I haven't come up with a workaround yet.
Steps to reproduce:
Output:
I'd expect to get the tokens: is, thi, this, dej, deja, vu
The problem gets worse when looking at non-whitespace languages where many characters are tokenized into one character per token.
I could search across multiple fields, but that prevents me from matching on phrases and using those phrase matches to boost results. For instance if the user types in "hi ther" we should be able to match instances where the content had "hi there" and use that to boost those exact matches. We do this by adding a simple should clause:
The text was updated successfully, but these errors were encountered: