Surface detailed/internal timings from Elasticsearch #1561

dgieselaar · 2020-12-08T14:01:28Z

It would be valuable to have more detailed and/or internal timings about Elasticsearch search requests, rather than just timings from the perspective of the client.

The profile API returns how much time is spent doing what on each shard, but it has a lot of overhead and doesn't give a great timeline of a search.

We can collect use cases and list them here and see what is feasible and what is needed from other teams to make it happen.

Examples:

Show how much time is spent networking between nodes
Show different phases/timings of a search request, e.g. canMatch
Show how long a search task was stuck in a queue

The text was updated successfully, but these errors were encountered:

nik9000 · 2020-12-08T14:08:44Z

We intentionally don't measure lots of stuff. And a lot of what we do measure is exposed over the _cluster/stats and _nodes/stats APIs. I think it might be a good idea to backtrack on those to see what we measure and whether you can hook into it.

dgieselaar · 2021-02-09T13:32:34Z

@nik9000 I'd like to have per-request timings, not necessarily per cluster/node/index/shard if that makes sense. I'll give an example for search to start the discussion but please keep in mind that I'm not very familiar with ES internals and it will not be the correct representation, but hopefully it makes it more clear what I'm looking for.

In the context of this discussion, a transaction is an event that describes the time it took for Elasticsearch to respond to a search request. A span is an (async) operation that happens in the context for this request, and is a child of the transaction. Both events have a start time (@timestamp) and a total duration.

Suppose service-a is calling /_search. The service creates an elasticsearch span. The client node that receives this request starts a transaction that is a child of this elasticsearch span. It then starts a query_shards span, and a transport span as a child. When a node receives a shard search request, it starts a shard_query_response span. This might be preceded by a wait_for_available_search_thread span - I'm not sure if this happens on a shard-level or on the coordinating node. When it has responded, it stops the span. When the coordinating node receives the shard response, it stops the transport span and the query_shards span, and starts a reduce span. When the results from shards are reduced, this span stops, and a fetch_shards span starts, with again a transport span as a child. The shard starts a shard_fetch_response span, and stops the span when it responds to the request from the coordinating node. When the coordinating node has received all the fetch responses, it starts a final_reduce (?) span, followed by a serialise_response span. After the response has been sent, the elasticsearch service will stop its transaction. When the response is received by service-a, service-a closes the elasticsearch span.

I think this would be useful in detecting network issues, thread pool exhaustion and optimisation opportunities (e.g. a search request that hits many shards because there is no range query included). The fact that these timings would be visible in the context of a trace can be very helpful because it helps people correlate slowness for a specific request to Elasticsearch issues. While cluster/node stats are useful they report averages which might not be helpful in analysing performance issues for a specific request. I also think this would be helpful in having people working with Elasticsearch understand its architecture better.

PhaedrusTheGreek · 2021-05-20T17:18:10Z

Today, the tasks API and the slow query log are the best two ways to hook into this information, but tasks are transient and slow query log is a narrow lens.

matschaffer · 2021-05-20T17:19:41Z

We've actually resorted to polling tasks API on a schedule to help spot bulks that are taking too long.

github-actions bot added the agent-java label Dec 8, 2020

SylvainJuge added the new-feature New feature label Feb 2, 2021

PhaedrusTheGreek mentioned this issue May 20, 2021

Search API response time breakdown elastic/elasticsearch#21073

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Surface detailed/internal timings from Elasticsearch #1561

Surface detailed/internal timings from Elasticsearch #1561

dgieselaar commented Dec 8, 2020

nik9000 commented Dec 8, 2020

Uh oh!

dgieselaar commented Feb 9, 2021

Uh oh!

PhaedrusTheGreek commented May 20, 2021

Uh oh!

matschaffer commented May 20, 2021

Uh oh!

Surface detailed/internal timings from Elasticsearch #1561

Surface detailed/internal timings from Elasticsearch #1561

Comments

dgieselaar commented Dec 8, 2020

nik9000 commented Dec 8, 2020

Uh oh!

dgieselaar commented Feb 9, 2021

Uh oh!

PhaedrusTheGreek commented May 20, 2021

Uh oh!

matschaffer commented May 20, 2021

Uh oh!