Skip to content

Search API response time breakdown #21073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
PhaedrusTheGreek opened this issue Oct 21, 2016 · 12 comments
Open

Search API response time breakdown #21073

PhaedrusTheGreek opened this issue Oct 21, 2016 · 12 comments
Labels
>enhancement high hanging fruit :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@PhaedrusTheGreek
Copy link
Contributor

There is currently no easy way to troubleshoot which phase of an API request is taking what amount of time.

It is possible to see a took time of a few milliseconds, but the actual API response takes 60 seconds. This can happen when there are network problems between nodes, and also in the case when integrated components such as AD or LDAP are slow to respond, or are timing out. In some cases there is no logging or other way to identify slow phases.

It would be awesome if we could do:

GET /index/_search?vtook

{
   "vtook": {
       "total_api" : 60000
       "security/ldap_realm" : 59980
       "total_query": 20
       "fetch" : 10
       "collate": 10
       "threadpool_queue" : 0
    }
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1000000,
      "max_score": 1,
      "hits": [
        ...
      ]
   }
}
@jpountz
Copy link
Contributor

jpountz commented Oct 24, 2016

If we do this, I think this would belong to the profile API.

@clintongormley clintongormley added discuss :Search/Search Search-related issues that do not fall into other categories labels Nov 5, 2016
@evanvolgas
Copy link

The Profile API makes sense but it would also be pretty useful if you could log this and do reporting on it, not just profile one-off cases. For example, maybe there could be a sample threshold in which every Nth query gets profiled with vtook and logged. That sample threshold parameter would be quite handy in general. To accomplish that now, I have to use tools like goreplay and rewrite the requests that I want to sample and profile.

@talevy talevy added :Search/Search Search-related issues that do not fall into other categories and removed :Search/Search Search-related issues that do not fall into other categories labels Mar 26, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@polyfractal
Copy link
Contributor

polyfractal commented Jul 19, 2018

Chatted about this in Fixit Friday Thursday. Adding some of the extra search phases (fetch, aggregation reduce, etc) wouldn't be too bad, just needs some work to implement.

Adding non-search phases like security or general network stages would probably need a dedicated framework to hook into phases of a request's lifecycle. The current search profiler is pretty hardwired to just work with Search... extending it would likely be more hassle than just sitting down to work on an overarching framework.

@PhaedrusTheGreek Was the interest more in the non-search phases (security, queuing, etc), adding the additional search phases (fetch, agg reduce, etc) or all the above??

@PhaedrusTheGreek
Copy link
Contributor Author

@polyfractal this matter has come up again today, in regards to wanting to understand how much time a query spends in the queue. Previously, the matter was raised by a ~1m LDAP authentication delay.

@PhaedrusTheGreek
Copy link
Contributor Author

PhaedrusTheGreek commented Dec 6, 2018

@polyfractal - adding a network stage metric could be used to assist/resolve this matter, closed in favour of #36127

{
   "vtook": {
       "total" : 1024,
       "network" : 1020,
       "security" : 0,
       "query": 4,
       "fetch" : 2,
       "collate": 2,
       "threadpool_queue" : 0
    }
}

I wonder if something like this could be added to indices stats also, if so it would tie into monitoring well

@polyfractal
Copy link
Contributor

We're still unfortunately at where #21073 (comment) left off... the profiler framework is purely in the Search Phase and would need some fairly substantial overhaul to take into account other things like network. :(

I still think this is an enhancement that the Profiler should get, it'd be very useful (and also related to #23114 which could provide some of this data too)

That said, the new Adaptive Replica Selection sorta provides these details, albeit in a more roundabout manner.

E.g. ARS tracks the overall latencies between nodes so that it can choose better replicas for search.

"adaptive_selection" : {
        "<nodeID>" : {
          "outgoing_searches" : 0,
          "avg_queue_size" : 0,
          "avg_service_time" : "342.5ms",
          "avg_service_time_ns" : 342511123,
          "avg_response_time" : "1.2ms",
          "avg_response_time_ns" : 1256317,
          "rank" : "1.3"
        },
}

Which essentially represents the node-to-node service and response time from a 10,000ft view. Doesn't help with individual queries, but might provide some insight into the effect of network (and load) on the cluster.

@PhaedrusTheGreek
Copy link
Contributor Author

PhaedrusTheGreek commented May 31, 2019

Another option to track the problem of unresponsive authentication services could be to track timeouts/errors in some counter instead of having to rely on finding errors in the logs.

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@kaliljoao
Copy link

Hi guys, can I work on this issue?

@PhaedrusTheGreek
Copy link
Contributor Author

Surfacing this information could potentially be solved with APM via elastic/apm-agent-java#1561

@javanna
Copy link
Member

javanna commented Mar 7, 2022

I believe that what will be implemented as part of #84369 will help solving this issue. We'll have to evaluate whether more search specific info needs to be exposed as a follow-up.

@javanna javanna changed the title API response time breakdown Search API response time breakdown Jun 17, 2022
@javanna javanna added :Search Foundations/Search Catch all for Search Foundations and removed :Search/Search Search-related issues that do not fall into other categories labels Jul 17, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 17, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine elasticsearchmachine removed the Team:Search Meta label for search team label Jul 17, 2024
@javanna javanna removed the help wanted adoptme label Apr 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement high hanging fruit :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

No branches or pull requests