-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Order by scripted_metric sub aggregation #8486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@eryabitskiy this is something that I have been thinking about but requires a few outstanding issues to be resolved first. Specifically these are #8421 and #8434. These would allow us to specify much more powerful order paths and the getProperty method on the scripted metric aggregation could be used to retrieve arbitrary properties of the scripts results. |
Hi there, I'm not sure if this is appropriate but I thought you may want to gauge interest. We're very keen to see this as well. Currently we end up doing a lot of sorting on oversized resultsets in Go, whereas being able to sort on scripted metrics would save us this hassle. |
+1 This would be great if solved! |
+1 |
1 similar comment
+1 |
👍 sort on scripted metrics aggregations would be a killer feature. In the meantime we're also doing client-side sorting on oversized results. |
+1 |
+1 |
2 similar comments
+1 |
+1 |
+1 Aside from the other use-cases mentioned here, this feature would give the Kibana product some major strength, particularly for rapid prototyping. |
+1 |
1 similar comment
+1 |
Thank you for your answer! I understand that there is always a trade of between performance and accuracy especially in FTS engine. But if you fetch all buckets from shard you can always reach 100% accuracy. And I suspect, that actually most of folks would be ok with all buckets returned from every shard for the sake of 100% accuracy over performance and memory consuming (we can make a vote to check it). We do it already on Java server side anyway. Also you can always set some protector max size property that triggers an error on too much memory consumption during such queries. Can you consider such feature? |
+1. For my use case, I actually need to fetch all buckets, then sort and slice on client-side. |
We only add features to Elasticsearch which are horizontally scalable. Whatever we add should work when you're running one node on your laptop with 50GB of data or 1000 nodes in your data server with 50 PB of data. Fetching all terms from all shards does not scale horizontally, and so we will not add it.
Exactly. This is a problem that should be solved client side instead (where you know the limits of your data and how much you will need to scale). |
I see... Than I can mention one horizontally scalable use case: if each bucket is fully allocated withing one shard (by using terms field also as a routing) you can simply fetch only top buckets from shards and still get accurate results. Unfortunately it is only works together with proper routing. |
Is there any plan that will fulfill this feature? In my opinion, these aggregations, relevant to calculation like sum/min/max/avg/count etc, are all metrics event the result is not in numeric format. For example: |
I'm currently moving back my analytics to an SQL platform because of this. |
Currently, I will try to translate different kinds of bucket-script-metric into the forms of map-reduce-script in a general method, then ordering-phase about these metrics can be supported. |
This requires the knowledge of how much values the aggregation will return. You can not ask ES to "return all buckets" on an aggregation and you have to specify an arbitrary limit number, which is completely nonsense and performance-killing. |
@anhzhi I'm having trouble finding the text that you quoted for context (sorry if I've missed it in the comments above), could you paste the link to the comment or documentation you saw this in? The statement is however very true.
If you are talking about being able to sort the terms aggregation by a pipeline aggregation then I don't see how this would be possible to implement. The terms aggregation needs to sort the buckets collected on the shard so it can return only the So, given the fact that we have to sort on the shard the only other solution would be to 'push down' the calculation of pipeline aggregations onto the shards. But this would not work either since pipeline aggregations at their heart compare the final results of different aggregations (think derivative where you are comparing the final values in consecutive buckets, or dividing two sum aggregations). On the shard the final result is not known because that result depends on the results from all the shards not just one, so if you calculated the value just based on the information on that shard the result would almost certainly be wrong and worse it could be wildly affected by the results of the other shards so would not even be a good estimate of the final value. So pushing down the calculation of pipeline aggregation onto the shards doesn't help with sorting the terms aggregation by pipeline aggregations either. If you were referring instead to sorting the terms aggregation by the scripted_metric aggregations, it was discussed in #8486 (comment) and we decided that rather than we would rather not add more cases where the error in the terms aggregation is unbounded so we would instead like to try to solve the specific use cases for the scripted_metric aggregation.
There are no plans at this time. We need a value to compare the buckets in order to sort them and a numeric value gives us a value that can always be compared without complicating the API. On the other hand arbitrary objects do not naturally compare so the API would need to be made more complex and harder to use to support this.
See my comments to your first question above
I'm not sure I understand this since the output of the min aggregation is a singular double value. For the rest API we do wrap this up in an object so you can navigate to it in the response but for the purposes of comparing when sorting it is a singular double value, and importantly we know that it will always be a singular double value so we can validate it in the request before we execute the request.
Unfortunately this is also true for Elasticsearch itself because of the reasons I outlined answering your first question; that we need the information from all shards to be able to calculate the results of the pipeline aggregations in order to be able to do the sorting. I would love to find a solution to this but I don't see how we could do so at this stage in a reliable, scalable way within the constraints of sharding. |
@colings86 The quoted text is from
Now i get what you say. I think it needs some more boost improvements for distributed-calculation algorithms and software frameworks of ElasticSearch. |
@anhzhi , totally agreed. Distributed calculation is a big miss for any big data analytics. At this exactly moment, I need to do such thing, but I just got disappointed when I reached and read this thread. |
@tberne @anzhi I have moved away from ElasticSearch because of this: needed to sort on scripted metrics and create range facets. Since ES was unable to provide this, I used to retrieve all documents and do that logic on the PHP side, which was completely counterproductive and performance-killing. Last week I have delivered a new version of this application on which I worked for several months, and the storage engine is now based on MariaDb Columnstore (the only open-source SQL engine that stores data in columns instead of rows). Performance is not as good as ElasticSearch (and INSERT/UPDATE/DELETE take ages), but at least, creating scripted metrics and sorting on them (SELECT SUM(bytes_in + bytes_out) AS sum_bytes_total [...] ORDER BY sum_bytes_total DESC) is a piece of cake and performance is acceptable (queries about ~150ms on tables with dozens of millions rows). Of course, you don't have facets (aggregations) out of the box and you have to create them by yourself, which is more complicated to do than ES, but at least it works, and all the logic is done within the storage engine; no more tricky things to complete a non-supported behavior. I formerly thought ElasticSearch was the good choice as the main engine of an BI application, but I was wrong: ElasticSearch is perfect for full-text, not for analytics. |
I couldn't agree more. ES is just perfect when the thing is Full Text Search. Even with some simple aggregation scenario, it is a killing tool. But when we start doing some serious analysis, it lacks some major things. I don't know yet, but I think I will keep ES as a big data storage and, perhaps, I will introduce the use of some stream processor (like Apache Flink) to consolidate some data back to ES. |
@bpolaszek - Cant you get the conversion rate per referer per ad using the script provision in sum aggregation and sum it using the same. This can be used later to order the referers.
|
There's a solution with ES that still requires client-side sorting, but should be more efficient for retrieving all the results you need in order to sort them: try the composite agg which allows you to retrieve all results (with pagination) and then you can do a merge sort client side. |
Very interesting discussion! I do wonder how Solr solved this efficiently? Are there any new feature that would allow this, without retrieving all results and sorting client side? Our use case is similar: calculate CTR (clicks / impressions) on aggregated fields and sort results by the highest values. |
AFAIK Solr did not solve this either. |
Thanks @bpolaszek It appears in our case we can shard/route the data, which means that data for every search query will always live on the same server and there would be no need to shuffle it around the cluster. We are considering to write a custom plugin to allow us sorting on calculated fields. Any suggestions on how this could be achieved, given the above routing limitation? I had a look at the existing plugins but couldn't find anything similar. |
+1 |
Issue +1. Help add support |
Are there any workarounds for the moment? |
Because there is still a some activity on this bug, I like to mention a solution: We added transform in This solution is scalable and works with large amount of data. Starting with |
Sounds cool! Do you have an example query? (I'm no longer in ES, but still curious 😀) |
@colings86 Although this became very old, but I ran into the same limitation, but i was able to solve it (apparently) using bucket sort aggregation.
What i am missing ? |
@AbbassFaytaroony Can cofirm. The method of using "bucket_sort" solved the problem for me too. |
@syntax42 @AbbassFaytaroony |
@Stormtv Yes, i found that out too :-( |
It appears that you can now sort by numeric scripted_metrics but cannot sort by strings despite the fact that they are simple types and a viable return value. Does anyone have any insight on that? |
@AbbassFaytaroony do you know if it is possible that the script returns a map and I want to specify on which key to sort. I tried something like but I'm getting request validation error |
Since there is a new Scripted metric aggregation (scripted_metric) in 1.4, it is possible to do a lot of amazing stuff.
For example it is possible to implement Weighted Average aggregation, which we were missing before.
Now we are really missing a possibility to sort by scripted_metric results.
Live example:
We calculate weightedAvgVis with scripted_metric and want to get ids with TOP 5 values of weightedAvgVis. Since script returns double, it looks logically possible.
The text was updated successfully, but these errors were encountered: