-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Expose proximity boosting #39385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose proximity boosting #39385
Changes from 1 commit
7dd4447
0b2ae25
de1263b
a591bb4
c934435
e3366da
a4bd771
2441174
83f0c84
be7a4f9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,214 @@ | ||
[[query-dsl-distance-feature-query]] | ||
=== Distance Feature Query | ||
|
||
The `distance_feature` query is a specialized query that only works | ||
on <<number,`long`>>, <<date, `date`>> or <<geo-point,`geo_point`>> | ||
fields. Its goal is to boost documents' scores based on proximity | ||
to some given origin. For example, use this query if you want to | ||
give more weight to documents with dates closer to a certain date, | ||
or to documents with locations closer to a certain location, | ||
or to documents with a long field closer to a certain number. | ||
|
||
This query is called `distance_feature` query, because it dynamically | ||
calculates distances between the given origin and documents' field values, | ||
and use these distances as features to boost the documents' scores. | ||
|
||
`distance_feature` query is typically put in a `should` clause of a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggestion: maybe also mention the nearest neighbors use-case, eg. " |
||
<<query-dsl-bool-query,`bool`>> query so that its score is added to the score | ||
of the query. | ||
|
||
Compared to using <<query-dsl-function-score-query,`function_score`>> or other | ||
ways to modify the score, this query has the benefit of being able to | ||
efficiently skip non-competitive hits when | ||
<<search-uri-request,`track_total_hits`>> is not set to `true`. Speedups may be | ||
spectacular. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice, do we have experiments/numbers/blogs to back this up? No need to change if we haven't but I was wondering if we could add anything in case we have it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cbuescher I have copied this phrase from rank_feature query which is using the same optimizations, and we also have a blog post on this. This blog post has a link to Lucene benchmarks, but looks like adding this link to these benchmarks would be excessive here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd maybe drop the last sentence with |
||
|
||
==== Syntax of distance_feature query | ||
|
||
`distance_feature` query has the following syntax: | ||
[source,js] | ||
-------------------------------------------------- | ||
"distance_feature": { | ||
"field": "my_field", | ||
"origin": <origin>, | ||
"pivot": <pivot>, | ||
"boost" : <boost> | ||
} | ||
-------------------------------------------------- | ||
// NOTCONSOLE | ||
|
||
[horizontal] | ||
`field`:: | ||
Required parameter. Defines a name of the field on which to calculate | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: s/a/the |
||
distances. Must be a field of type `long`, `date`, or `geo_point`, | ||
and must be indexed and has <<doc-values, doc values>>. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe "must be indexed using <<doc-values, doc values>>" or something similar instead of another "and" clause. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mayya refers to the fact that the field needs both |
||
|
||
`origin`:: | ||
Required parameter. Defines a point of origin used for calculating | ||
distances. Must be a long number for numeric fields, date for date fields | ||
and geo point for geo fields. Date math (for example `now-1h`) is | ||
supported for a date origin. | ||
|
||
`pivot`:: | ||
Required parameter. Defines the distance from origin at which the computed | ||
score will equal to a half of the `boost` parameter. Must be a long | ||
number for numeric fields, a `number+date unit` ("1h", "10d",...) for | ||
date fields, and a `number + geo unit` ("1km", "12m",...) for geo fields. | ||
|
||
`boost`:: | ||
Optional parameter with a default value of `1`. Defines the factor by which | ||
to multiply the score. Must be a non-negative float number. | ||
|
||
|
||
The `distance_feature` query computes a document's score as following: | ||
|
||
`score = boost * pivot / (pivot + distance)` | ||
|
||
where `distance` is the absolute difference between the origin and | ||
a document's field value. For date field the distance will be in | ||
milliseconds; for geo fields the distance is a haversine distance in meters. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Units don't matter, do they? |
||
|
||
==== Example using distance_feature query | ||
|
||
Let's look at an example. We index several documents containing | ||
information about sales items, such as name, production date, | ||
price, and location. | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
PUT items | ||
{ | ||
"mappings": { | ||
"properties": { | ||
"item_name": { | ||
"type": "keyword" | ||
}, | ||
"item_production_date": { | ||
"type": "date" | ||
}, | ||
"item_price": { | ||
"type": "long" | ||
}, | ||
"item_location": { | ||
"type": "geo_point" | ||
} | ||
} | ||
} | ||
} | ||
|
||
PUT items/_doc/1 | ||
{ | ||
"item_name" : "chocolate", | ||
"item_production_date": "2018-02-01", | ||
"item_price": 22, | ||
"item_location": [-71.34, 41.12] | ||
} | ||
|
||
PUT items/_doc/2 | ||
{ | ||
"item_name" : "chocolate", | ||
"item_production_date": "2018-01-01", | ||
"item_price": 25, | ||
"item_location": [-71.3, 41.15] | ||
} | ||
|
||
|
||
PUT items/_doc/3 | ||
{ | ||
"item_name" : "chocolate", | ||
"item_expiry_date": "2017-12-01", | ||
"item_production_date": 19, | ||
"item_location": [-71.3, 41.12] | ||
} | ||
|
||
POST items/_refresh | ||
-------------------------------------------------- | ||
// CONSOLE | ||
|
||
We look for all chocolate items, but we also want chocolates | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Maybe in all three cases start with "We can look", or "We now want to look" to make it less repetitive? |
||
that are closer to our origin price come first in the result list. | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
GET items/_search | ||
{ | ||
"query": { | ||
"bool": { | ||
"must": { | ||
"match": { | ||
"item_name": "chocolate" | ||
} | ||
}, | ||
"should": { | ||
"distance_feature": { | ||
"boost" :2, | ||
"field": "item_price", | ||
"pivot": 5, | ||
"origin": 15 | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// CONSOLE | ||
// TEST[continued] | ||
|
||
|
||
We look for all chocolate items, but we also want chocolates | ||
that are produced recently (closer to the date `now`) | ||
to be ranked higher. | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
GET items/_search | ||
{ | ||
"query": { | ||
"bool": { | ||
"must": { | ||
"match": { | ||
"item_name": "chocolate" | ||
} | ||
}, | ||
"should": { | ||
"distance_feature": { | ||
"field": "item_production_date", | ||
"pivot": "7d", | ||
"origin": "now" | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// CONSOLE | ||
// TEST[continued] | ||
|
||
We look for all chocolate items, but we also want chocolates | ||
that are produced locally (closer to our geo origin) | ||
come first in the result list. | ||
|
||
[source,js] | ||
-------------------------------------------------- | ||
GET items/_search | ||
{ | ||
"query": { | ||
"bool": { | ||
"must": { | ||
"match": { | ||
"item_name": "chocolate" | ||
} | ||
}, | ||
"should": { | ||
"distance_feature": { | ||
"field": "item_location", | ||
"pivot": "1000m", | ||
"origin": [-71.3, 41.15] | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
// CONSOLE | ||
// TEST[continued] |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,6 +28,12 @@ the specified document. | |
A query that computes scores based on the values of numeric features and is | ||
able to efficiently skip non-competitive hits. | ||
|
||
<<query-dsl-distance-feature-query,`distance_feature` query>>:: | ||
|
||
A query that computes scores based on the dynamically computed distances | ||
between the origin and documents' long numeric, geo or distance fields. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe "geo-point" instead of "geo" like above There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should it mention dates? |
||
It is able to efficiently skip non-competitive hits. | ||
|
||
<<query-dsl-wrapper-query,`wrapper` query>>:: | ||
|
||
A query that accepts other queries as json or yaml string. | ||
|
@@ -42,4 +48,6 @@ include::percolate-query.asciidoc[] | |
|
||
include::rank-feature-query.asciidoc[] | ||
|
||
include::distance-feature-query.asciidoc[] | ||
|
||
include::wrapper-query.asciidoc[] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
setup: | ||
- skip: | ||
version: " - 7.9.99" #TODO adjust to 7.0.99 after merging to 7.x | ||
reason: "Implemented in 7.1" | ||
|
||
- do: | ||
indices.create: | ||
index: index1 | ||
body: | ||
settings: | ||
number_of_replicas: 0 | ||
mappings: | ||
properties: | ||
my_long: | ||
type: long | ||
my_date: | ||
type: date | ||
my_geo: | ||
type: geo_point | ||
|
||
- do: | ||
bulk: | ||
refresh: true | ||
body: | ||
- '{ "index" : { "_index" : "index1", "_id" : "1" } }' | ||
- '{ "my_long" : 22, "my_date": "2018-02-01T10:00:30Z", "my_geo": [-71.34, 41.13] }' | ||
- '{ "index" : { "_index" : "index1", "_id" : "2" } }' | ||
- '{ "my_long" : 25, "my_date": "2018-02-01T11:00:30Z", "my_geo": [-71.34, 41.14] }' | ||
- '{ "index" : { "_index" : "index1", "_id" : "3" } }' | ||
- '{ "my_long" : 19, "my_date": "2018-02-01T09:00:30Z", "my_geo": [-71.34, 41.12] }' | ||
|
||
--- | ||
"test distance_feature query on long type": | ||
|
||
- do: | ||
search: | ||
rest_total_hits_as_int: true | ||
index: index1 | ||
body: | ||
query: | ||
distance_feature: | ||
field: my_long | ||
pivot: 5 | ||
origin: 15 | ||
|
||
- length: { hits.hits: 3 } | ||
- match: { hits.hits.0._id: "3" } | ||
- match: { hits.hits.1._id: "1" } | ||
- match: { hits.hits.2._id: "2" } | ||
|
||
--- | ||
"test distance_feature query on date type": | ||
|
||
- do: | ||
search: | ||
rest_total_hits_as_int: true | ||
index: index1 | ||
body: | ||
query: | ||
distance_feature: | ||
field: my_date | ||
pivot: 1h | ||
origin: 2018-02-01T08:00:30Z | ||
|
||
- length: { hits.hits: 3 } | ||
- match: { hits.hits.0._id: "3" } | ||
- match: { hits.hits.1._id: "1" } | ||
- match: { hits.hits.2._id: "2" } | ||
|
||
--- | ||
"test distance_feature query on geo_point type": | ||
|
||
- do: | ||
search: | ||
rest_total_hits_as_int: true | ||
index: index1 | ||
body: | ||
query: | ||
distance_feature: | ||
field: my_geo | ||
pivot: 1km | ||
origin: [-71.35, 41.12] | ||
|
||
- length: { hits.hits: 3 } | ||
- match: { hits.hits.0._id: "3" } | ||
- match: { hits.hits.1._id: "1" } | ||
- match: { hits.hits.2._id: "2" } |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -545,6 +545,16 @@ private static GeoPoint parseGeoHash(GeoPoint point, String geohash, EffectivePo | |
} | ||
} | ||
|
||
public static GeoPoint parseFromString(String val){ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add one or two lines of javadoc explaining what type of String input this parses under which conditions? e.g, that what type of Strings are expected, that the "," serves as a cue to whether this is a geohash or not. Maybe it's also worth checking some edge conditions and throwing errors if they are not met (e.g. empty String etc...). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also nit: whitespace between "){" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cbuescher I have added the documentation for |
||
GeoPoint point = new GeoPoint(); | ||
boolean ignoreZValue = false; | ||
if (val.contains(",")) { | ||
return point.resetFromString(val, ignoreZValue); | ||
} else { | ||
return parseGeoHash(point, val, EffectivePoint.BOTTOM_LEFT); | ||
} | ||
} | ||
|
||
/** | ||
* Parse a precision that can be expressed as an integer or a distance measure like "1km", "10m". | ||
* | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it work for
date_nanos
too?