Skip to content

Query DSL: Terms filter to allow for terms lookup from another document #2674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kimchy opened this issue Feb 22, 2013 · 12 comments
Closed

Query DSL: Terms filter to allow for terms lookup from another document #2674

kimchy opened this issue Feb 22, 2013 · 12 comments

Comments

@kimchy
Copy link
Member

kimchy commented Feb 22, 2013

The terms filter requires providing all the terms as part of the filter itself. Allow to automatically extract them from an external document.

Here is an example:

# index the information for user with id 2, specifically, its friends
curl -XPUT localhost:9200/users/user/2 -d '{
   "friends" : ["1", "3"]
}'

# index a tweet, from user with id 2
curl -XPUT localhost:9200/tweets/tweet/1 -d '{
   "user" : "2"
}'

# search on all the tweets that match the friends of user 2
curl -XGET localhost:9200/tweets/_search -d '{
  "query" : {
    "filtered" : {
        "filter" : {
            "terms" : {
                "user" : {
                    "index" : "users",
                    "type" : "user",
                    "id" : "2",
                    "path" : "friends"
                },
                "_cache_key" : "user_2_friends"
            }
        }
    }
  }
}'

The above is higly optimized, both in a sense that the list of friends will not be fetched if the filter is already cached in the filter cache, and with internal LRU cache for fetching external values for the terms filter. Also, the entry in teh filter cache will not hold all the terms reducing the memory required for it.

_cache_key is recommedned to be set, so its simple to clear the cache associated with it using the clear cache API. For example:

curl -XPOST 'localhost:9200/tweets/_cache/clear?filter_keys=user_2_friends'

The structure of the external terms document can also include array of inner objects, for example:

curl -XPUT localhost:9200/users/user/2 -d '{
   "friends" : [
     {
       "id" : "1"
     },
     {
       "id" : "2"
     }
   ]
}'

In which case, the lookup path will be friends.id.

There is an additional cache involved, which caches the lookup of the lookup document to the actual terms. It is by default set to 10mb LRU size, but can be explicitly set using indices.cache.filter.terms.size.

Also, consider using an index with a single shard and fully replicated across all nodes if the "reference" terms data is not large. The lookup terms filter will prefer to execute the get request on a local node if possible, reducing the need for networking.

@kimchy kimchy closed this as completed in 03fdc6a Feb 22, 2013
@Downchuck
Copy link

This nearly finishes/fixes the feature issue #2671

@telvis07
Copy link

In this example, shouldn't tweet/1 have user "1" or "3"? This example doesn't return hits for me but it does when I change it to 1 or 3. I have a gist here: https://gist.github.com/telvis07/5469479

@loris
Copy link

loris commented May 17, 2013

@kimchy Quick question about using this feature vs using the IDs filter
I have a use case where I would need to fetch IDs from an external datastore (mysql and redis) and make some get (with multi get) or search (with the IDs filter) in ElasticSearch against the list of documents matching the IDs.
The amount of IDs per search can vary from some dozens to a few thousands.
That said, will this perform poorly? Should I use the lookup term feature instead (would also mean that I would need to index the IDs and maintain sync with the primary datastores) ?

I will probably implement both for benchmark purpose but would love to hear from your feedback!

@clintongormley
Copy link
Contributor

If you index the terms into ES, and use the "external terms filter", you
will get significantly better performance, because:

  1. you greatly reduce the amount of network traffic
  2. you greatly reduce the amount of query parsing
  3. your filter will be cached after the first use, and thus very fast on
    subsequent uses

clint

On 17 May 2013 21:51, Loris Guignard [email protected] wrote:

@kimchy https://github.com/kimchy Quick question about using this
feature vs using the IDs filter
I have a use case where I would need to fetch IDs from an external
datastore (mysql and redis) and make some get (with multi get) or search
(with the IDs filter) in ElasticSearch against the list of documents
matching the IDs.
The amount of IDs per search can vary from some dozens to a few thousands.
That said, will this perform poorly? Should I use the lookup term feature
instead (would also mean that I would need to index the IDs and maintain
sync with the primary datastores) ?

I will probably implement both for benchmark purpose but would love to
hear from your feedback!


Reply to this email directly or view it on GitHubhttps://github.com//issues/2674#issuecomment-18082512
.

@junjun-zhang
Copy link

This is a very useful feature. Just curious whether it's possible to generalize this to support JOIN. In the case of join, the list of lookup terms is not fetched from another document, but rather it's the result of a query from a related document. Replicating this related document in all nodes can also eliminate networking.

I have a use case where I need to embed a particular document under another related document as nested doc. As it is a many-to-many relationship, this embedding introduced a huge number of redundant docs. If JOIN is supported, I will not need to embed the actual doc, include a field keeping the related doc IDs will be sufficient.

It seems Solr supports join in a similar fashion: http://wiki.apache.org/solr/Join. It is somewhat limited, but if used properly, it can be very helpful.

@mattweber
Copy link
Contributor

@junjun-zhang see #3278. Hopefully @martijnvg and @kimchy will get a chance to have a look at this soon.

@brupm
Copy link

brupm commented Apr 17, 2015

In this example:

curl -XGET localhost:9200/tweets/_search -d '{
  "query" : {
    "filtered" : {
        "filter" : {
            "terms" : {
                "user" : {
                    "index" : "users",
                    "type" : "user",
                    "id" : "2",
                    "path" : "friends"
                }
            }
        }
    }
  }
}

Say I wanted to pass an array of ids instead of a single id as it's shown "id" : "2"

Reason is I have several documents I want to combine.

@clintongormley
Copy link
Contributor

@brupm then just use several terms lookup filters, wrapped in a bool.should filter. Doing this lookup is not cheap, so I would prefer not to add syntax that makes it look cheap to the naive user.

@brupm
Copy link

brupm commented Apr 25, 2015

Is there an upper limit on who many terms filters I can have wrapped in a bool.should? @clintongormley - thank you!

@clintongormley
Copy link
Contributor

Probably 1024, which should be more than enough...

@banupriya20
Copy link

please provide a suggestion on this Index 1 and index 2 had common entity (Ex. Empl no.)
how to Create join query to search on index1 and get the document from index2 based on the common entity(emp no)

@saralamuralikrishna
Copy link

I am getting only 400 even if the lookup type has 524 documnts. any suggestion on what could be wrong. below is the query
{"from":0,"size":1000,"sort":[{"Id":{"order":"asc"}}],"query":{"terms":{"Id.Raw":{"index":"myindex","type":"Infos","id":32939,"path":"ArticleNumbers"}}}}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants