Skip to content

Synonyms break fuzziness #25518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jjfalling opened this issue Jul 3, 2017 · 13 comments
Closed

Synonyms break fuzziness #25518

jjfalling opened this issue Jul 3, 2017 · 13 comments
Labels
>bug :Search Relevance/Analysis How text is split into tokens :Search/Search Search-related issues that do not fall into other categories Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@jjfalling
Copy link

Elasticsearch version: 5.1

Plugins installed: []

JVM version (java -version): Oracle 1.8

OS version (uname -a if on a Unix-like system): OSX

Description of the problem including expected versus actual behavior:

When searching for a term that is in a list of synonyms, the query will not give results for terms in documents that would normally match a query with a fuzziness value. The query returns the expected document after removing the search term from the synonyms list.

Below is a reproduction of the issue. This occurs if the search analyzer is specified in the query or if it is defined on the field in the mapping.

PUT /test_index
{  
   "settings":{  
      "analysis":{  
         "analyzer":{  
            "synonym":{  
               "tokenizer":"standard",
               "filter":[
                  "apostrophe",
                  "synonym"
               ]
            }
         },
         "filter":{  
            "synonym":{  
               "type":"synonym",
               "synonyms":[  
                  "alf,fred"
               ]
            }
         }
      }
   }
}

PUT /test_index/person/1
{
  "name": "ali"
}

GET /test_index/person/_search
{  
   "query":{  
      "match":{  
         "name":{  
            "query":"alf",
            "analyzer":"synonym",
            "fuzziness":"AUTO"
         }
      }
   }
}
@jpountz
Copy link
Contributor

jpountz commented Jul 4, 2017

I'm not sure we should do anything. I think a popular use-case of fuzzy queries is to search regardless of potential typos. If I'm assuming that the query is correct and I want to find docs that have the term mistyped, then applying synonyms might make sense. On the other hand, if the term is mistyped and I want to find documents that have the correct typing, then applying synonyms does not sound like a good idea anymore.

Another perspective on this issue: If you apply synonyms, it means that you want to search based on the meaning of words, and it is likely that you are also applying stemming so that plurals and singulars are collapsed into the same token for instance. However stemming makes little sense in combination with fuzzy queries, or otherwise you could end up with weird things such as jumping being considered at a distance 2 of jam due to the removal of -ing suffixes. It gets even worse when the stemmer also changes letters, eg. they often replace trailing ys with is in english.

@tobymiller
Copy link

In which case surely synonyms and fuzziness should be completely incompatible, and throw an error, rather than silently not applying fuzziness to words with synonyms?

In my use case, which prompted this bug report, we definitely need to do both. If synonyms and fuzziness become incompatible we would bool together one query for each, but this would confuse the scoring in correctly spelt cases so not ideal.

@jpountz
Copy link
Contributor

jpountz commented Jul 4, 2017

Yeah, there are multiple cases like that, for instance prefix queries make little sense if the analyzer has edge ngrams, wildcard queries make little sense if the analyzer has a stemmer, etc. but analyzers are totally opaque (for good reasons) so we can't check for which filters they wrap.

@tobymiller
Copy link

Fair enough. It does seem that synonyms and fuzziness have a sensible way that they could behave together, even if in most cases it's a bad idea, though.

@cbuescher
Copy link
Member

@jpountz after reading your comments, I don't know if we should keep this issue open. Do you think there is anything we should do here? Or maybe add another round of discussion?

@cbuescher cbuescher added :Search Relevance/Analysis How text is split into tokens :Search/Search Search-related issues that do not fall into other categories labels Jul 10, 2017
@jpountz
Copy link
Contributor

jpountz commented Jul 10, 2017

I understand this might be a bit controversial, so I did not want to close the issue right away. @jimczi Do you have an opinion about this?

@al
Copy link

al commented Jul 25, 2017

FWIW I'd like to add that I also have a use case where I'd like to be able to perform synonym expansion followed by fuzzy matching. Basically I'd like to define the synonyms google,alphabet so that a search for google would match google, and alphabet, but also gooogle, alpabet, etc.

@jimczi
Copy link
Contributor

jimczi commented Jul 28, 2017

Sorry I missed the ping. I agree with @jpountz, applying fuzziness to query-time synonyms sounds weird to me. If a synonym rule matches an input it means that the input is correctly spelt and that the expansion should match exact terms. The other problem is that we can't differentiate the original term(s) from its synonyms after the analysis so we can't apply fuzziness to the input terms only (which IMO would be an acceptable solution) and it becomes even harder if the synonyms is multi-word.
The workaround is to use index-time synonyms that would index both terms google, alphabet when google or alphabet are encountered in the documents. The fuzziness would work for a query like google which is expected but also for a query like alpabet with fuzziness enabled.

@romseygeek
Copy link
Contributor

cc @elastic/es-search-aggs

@romseygeek
Copy link
Contributor

It looks as though we have a consensus here that we want to keep the existing behaviour. Shall I close this @jimczi @jpountz ?

@jpountz
Copy link
Contributor

jpountz commented Apr 3, 2019

Maybe make it a documentation issue @romseygeek. My mind hasn't changed but I can understand how this can be confusing.

@romseygeek
Copy link
Contributor

I documented this in #40783

@zainabtareen
Copy link

@romseygeek Suppose we have goooglerandom in query, and term is indexed as google. I apply edge_gram to separate out gooogle from the string. Now I need to apply fuzzy on this to return match with google. What would you suggest?

@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search Relevance/Analysis How text is split into tokens :Search/Search Search-related issues that do not fall into other categories Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

10 participants