Skip to content

Expose proximity boosting #39385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
214 changes: 214 additions & 0 deletions docs/reference/query-dsl/distance-feature-query.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
[[query-dsl-distance-feature-query]]
=== Distance Feature Query

The `distance_feature` query is a specialized query that only works
on <<number,`long`>>, <<date, `date`>> or <<geo-point,`geo_point`>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it work for date_nanos too?

fields. Its goal is to boost documents' scores based on proximity
to some given origin. For example, use this query if you want to
give more weight to documents with dates closer to a certain date,
or to documents with locations closer to a certain location,
or to documents with a long field closer to a certain number.

This query is called `distance_feature` query, because it dynamically
calculates distances between the given origin and documents' field values,
and use these distances as features to boost the documents' scores.

`distance_feature` query is typically put in a `should` clause of a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: maybe also mention the nearest neighbors use-case, eg. "distance_feature query is typically used on its own to find the nearest neighbors to a given point, or put in ashould clause [...]"

<<query-dsl-bool-query,`bool`>> query so that its score is added to the score
of the query.

Compared to using <<query-dsl-function-score-query,`function_score`>> or other
ways to modify the score, this query has the benefit of being able to
efficiently skip non-competitive hits when
<<search-uri-request,`track_total_hits`>> is not set to `true`. Speedups may be
spectacular.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, do we have experiments/numbers/blogs to back this up? No need to change if we haven't but I was wondering if we could add anything in case we have it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher I have copied this phrase from rank_feature query which is using the same optimizations, and we also have a blog post on this. This blog post has a link to Lucene benchmarks, but looks like adding this link to these benchmarks would be excessive here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd maybe drop the last sentence with spectacular since it's going to be a bit less efficient than rank_feature due to the dynamic nature of the feature (it's computed on the fly).


==== Syntax of distance_feature query

`distance_feature` query has the following syntax:
[source,js]
--------------------------------------------------
"distance_feature": {
"field": "my_field",
"origin": <origin>,
"pivot": <pivot>,
"boost" : <boost>
}
--------------------------------------------------
// NOTCONSOLE

[horizontal]
`field`::
Required parameter. Defines a name of the field on which to calculate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/a/the

distances. Must be a field of type `long`, `date`, or `geo_point`,
and must be indexed and has <<doc-values, doc values>>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "must be indexed using <<doc-values, doc values>>" or something similar instead of another "and" clause.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mayya refers to the fact that the field needs both indexed:true and doc_values:true. Maybe we could be more explicit by saying eg "[...] and must be indexed (index: true, which is the default) and have doc values (doc_values: true, which is the default too).


`origin`::
Required parameter. Defines a point of origin used for calculating
distances. Must be a long number for numeric fields, date for date fields
and geo point for geo fields. Date math (for example `now-1h`) is
supported for a date origin.

`pivot`::
Required parameter. Defines the distance from origin at which the computed
score will equal to a half of the `boost` parameter. Must be a long
number for numeric fields, a `number+date unit` ("1h", "10d",...) for
date fields, and a `number + geo unit` ("1km", "12m",...) for geo fields.

`boost`::
Optional parameter with a default value of `1`. Defines the factor by which
to multiply the score. Must be a non-negative float number.


The `distance_feature` query computes a document's score as following:

`score = boost * pivot / (pivot + distance)`

where `distance` is the absolute difference between the origin and
a document's field value. For date field the distance will be in
milliseconds; for geo fields the distance is a haversine distance in meters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Units don't matter, do they?


==== Example using distance_feature query

Let's look at an example. We index several documents containing
information about sales items, such as name, production date,
price, and location.

[source,js]
--------------------------------------------------
PUT items
{
"mappings": {
"properties": {
"item_name": {
"type": "keyword"
},
"item_production_date": {
"type": "date"
},
"item_price": {
"type": "long"
},
"item_location": {
"type": "geo_point"
}
}
}
}

PUT items/_doc/1
{
"item_name" : "chocolate",
"item_production_date": "2018-02-01",
"item_price": 22,
"item_location": [-71.34, 41.12]
}

PUT items/_doc/2
{
"item_name" : "chocolate",
"item_production_date": "2018-01-01",
"item_price": 25,
"item_location": [-71.3, 41.15]
}


PUT items/_doc/3
{
"item_name" : "chocolate",
"item_expiry_date": "2017-12-01",
"item_production_date": 19,
"item_location": [-71.3, 41.12]
}

POST items/_refresh
--------------------------------------------------
// CONSOLE

We look for all chocolate items, but we also want chocolates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe in all three cases start with "We can look", or "We now want to look" to make it less repetitive?

that are closer to our origin price come first in the result list.

[source,js]
--------------------------------------------------
GET items/_search
{
"query": {
"bool": {
"must": {
"match": {
"item_name": "chocolate"
}
},
"should": {
"distance_feature": {
"boost" :2,
"field": "item_price",
"pivot": 5,
"origin": 15
}
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[continued]


We look for all chocolate items, but we also want chocolates
that are produced recently (closer to the date `now`)
to be ranked higher.

[source,js]
--------------------------------------------------
GET items/_search
{
"query": {
"bool": {
"must": {
"match": {
"item_name": "chocolate"
}
},
"should": {
"distance_feature": {
"field": "item_production_date",
"pivot": "7d",
"origin": "now"
}
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[continued]

We look for all chocolate items, but we also want chocolates
that are produced locally (closer to our geo origin)
come first in the result list.

[source,js]
--------------------------------------------------
GET items/_search
{
"query": {
"bool": {
"must": {
"match": {
"item_name": "chocolate"
}
},
"should": {
"distance_feature": {
"field": "item_location",
"pivot": "1000m",
"origin": [-71.3, 41.15]
}
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[continued]
8 changes: 8 additions & 0 deletions docs/reference/query-dsl/special-queries.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,12 @@ the specified document.
A query that computes scores based on the values of numeric features and is
able to efficiently skip non-competitive hits.

<<query-dsl-distance-feature-query,`distance_feature` query>>::

A query that computes scores based on the dynamically computed distances
between the origin and documents' long numeric, geo or distance fields.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "geo-point" instead of "geo" like above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it mention dates?

It is able to efficiently skip non-competitive hits.

<<query-dsl-wrapper-query,`wrapper` query>>::

A query that accepts other queries as json or yaml string.
Expand All @@ -42,4 +48,6 @@ include::percolate-query.asciidoc[]

include::rank-feature-query.asciidoc[]

include::distance-feature-query.asciidoc[]

include::wrapper-query.asciidoc[]
Original file line number Diff line number Diff line change
Expand Up @@ -440,6 +440,7 @@ public enum ValueType {
OBJECT_OR_LONG(START_OBJECT, VALUE_NUMBER),
OBJECT_ARRAY_BOOLEAN_OR_STRING(START_OBJECT, START_ARRAY, VALUE_BOOLEAN, VALUE_STRING),
OBJECT_ARRAY_OR_STRING(START_OBJECT, START_ARRAY, VALUE_STRING),
OBJECT_ARRAY_STRING_OR_NUMBER(START_OBJECT, START_ARRAY, VALUE_STRING, VALUE_NUMBER),
VALUE(VALUE_BOOLEAN, VALUE_NULL, VALUE_EMBEDDED_OBJECT, VALUE_NUMBER, VALUE_STRING),
VALUE_OBJECT_ARRAY(VALUE_BOOLEAN, VALUE_NULL, VALUE_EMBEDDED_OBJECT, VALUE_NUMBER, VALUE_STRING, START_OBJECT, START_ARRAY),
VALUE_ARRAY(VALUE_BOOLEAN, VALUE_NULL, VALUE_NUMBER, VALUE_STRING, START_ARRAY);
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
setup:
- skip:
version: " - 7.9.99" #TODO adjust to 7.0.99 after merging to 7.x
reason: "Implemented in 7.1"

- do:
indices.create:
index: index1
body:
settings:
number_of_replicas: 0
mappings:
properties:
my_long:
type: long
my_date:
type: date
my_geo:
type: geo_point

- do:
bulk:
refresh: true
body:
- '{ "index" : { "_index" : "index1", "_id" : "1" } }'
- '{ "my_long" : 22, "my_date": "2018-02-01T10:00:30Z", "my_geo": [-71.34, 41.13] }'
- '{ "index" : { "_index" : "index1", "_id" : "2" } }'
- '{ "my_long" : 25, "my_date": "2018-02-01T11:00:30Z", "my_geo": [-71.34, 41.14] }'
- '{ "index" : { "_index" : "index1", "_id" : "3" } }'
- '{ "my_long" : 19, "my_date": "2018-02-01T09:00:30Z", "my_geo": [-71.34, 41.12] }'

---
"test distance_feature query on long type":

- do:
search:
rest_total_hits_as_int: true
index: index1
body:
query:
distance_feature:
field: my_long
pivot: 5
origin: 15

- length: { hits.hits: 3 }
- match: { hits.hits.0._id: "3" }
- match: { hits.hits.1._id: "1" }
- match: { hits.hits.2._id: "2" }

---
"test distance_feature query on date type":

- do:
search:
rest_total_hits_as_int: true
index: index1
body:
query:
distance_feature:
field: my_date
pivot: 1h
origin: 2018-02-01T08:00:30Z

- length: { hits.hits: 3 }
- match: { hits.hits.0._id: "3" }
- match: { hits.hits.1._id: "1" }
- match: { hits.hits.2._id: "2" }

---
"test distance_feature query on geo_point type":

- do:
search:
rest_total_hits_as_int: true
index: index1
body:
query:
distance_feature:
field: my_geo
pivot: 1km
origin: [-71.35, 41.12]

- length: { hits.hits: 3 }
- match: { hits.hits.0._id: "3" }
- match: { hits.hits.1._id: "1" }
- match: { hits.hits.2._id: "2" }
10 changes: 10 additions & 0 deletions server/src/main/java/org/elasticsearch/common/geo/GeoUtils.java
Original file line number Diff line number Diff line change
Expand Up @@ -545,6 +545,16 @@ private static GeoPoint parseGeoHash(GeoPoint point, String geohash, EffectivePo
}
}

public static GeoPoint parseFromString(String val){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add one or two lines of javadoc explaining what type of String input this parses under which conditions? e.g, that what type of Strings are expected, that the "," serves as a cue to whether this is a geohash or not. Maybe it's also worth checking some edge conditions and throwing errors if they are not met (e.g. empty String etc...).
Alternatively you could move this helper to DistanceFeatureQueryBuilder and make it package private, in this case I don't think we need so much checking and docs since we are sure to use it only from that class and tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also nit: whitespace between "){"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher I have added the documentation for parseFromString. I have not added any checks for invalid input, as these checks are done withing interal functions that parseFromString calls.

GeoPoint point = new GeoPoint();
boolean ignoreZValue = false;
if (val.contains(",")) {
return point.resetFromString(val, ignoreZValue);
} else {
return parseGeoHash(point, val, EffectivePoint.BOTTOM_LEFT);
}
}

/**
* Parse a precision that can be expressed as an integer or a distance measure like "1km", "10m".
*
Expand Down
Loading