Timeout Exception diagnosis #2486

pianoman4873 · 2023-06-21T11:45:04Z

Hello ,
Could you please help diagnose the following timeout exception -

StackExchange.Redis.RedisTimeoutException: Timeout performing EXISTS (30000ms), next: INFO, inst: 61, qu: 0, qs: 294, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 95891, in-pipe: 0, out-pipe: 0, last-in: 58, cur-in: 0, sync-ops: 25616118, async-ops: 400996552, serverEndpoint: 10.71.88.247:6379, conn-sec: 128327.66, mc: 1/1/0, mgr: 10 of 10 available, clientName: IP-0A475317(SE.Redis-v2.6.90.64945), PerfCounterHelperkeyHashSlot: 16275, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=173,Free=32594,Min=16,Max=32767), POOL: (Threads=191,QueuedItems=5686,CompletedItems=2991371194), v: 2.6.90.64945

What caught my eye in:95891 -
I assume a lot of data is pending in the queue so maybe that has to do with limited resources on the machine running the client ?
BTW - Could this he somehow related to upgrading the OS in the client machines to Windows Server 2019 ( it was 2012 before ) ?
If you could add any more insights that would be great.

We didn't see any increase load on the Redis servers that coincided with the exception time.

Thanks a lot !

NickCraver · 2023-06-21T13:23:48Z

The main problem you have here is thread pool exhaustion (POOL: (Threads=191,QueuedItems=5686,CompletedItems=2991371194)) - there are a lot of synchronous calls we see in sync-ops: 25616118, async-ops: 400996552 but there could by sync outside SE.Redis as well. In any case, when those happen you occupy a thread from the pool until some relatively very-long operations complete...eventually exhausting the pool.

If you take a memory dump under load, you can see where all those threads are stuck, it may be 1 or 2 hot spots worth making async ASAP.

pianoman4873 · 2023-06-21T14:03:57Z

Thanks a lot Nick.

vladislav-karamfilov · 2023-06-28T13:35:57Z

In order to prevent potential duplicate issue, I'll reuse this one to ask for help to figure out (potentially) similar problem.

During the last 14 days we observe an increased amount of StackExchange.Redis.RedisTimeoutException. Here is an example:

Timeout awaiting response (outbound=0KiB, inbound=0KiB, 5453ms elapsed, timeout is 5000ms), command=HGET, next: HGET realtime-user-connections, inst: 0, qu: 0, qs: 0, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 0, last-in: 28, cur-in: 0, sync-ops: 0, async-ops: 65, serverEndpoint: <NAME>.redis.cache.windows.net:6380, conn-sec: 17390.36, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: <SERVER>(SE.Redis-v2.6.116.40240), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=2,Max=32767), POOL: (Threads=5,QueuedItems=0,CompletedItems=270344), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

We are using Azure Cache for Redis and we haven't made any changes on this resource; we haven't made any changes to how we access the data in Redis; the only related change we did during this period is to update the StackExchange.Redis package from 2.6.111 to 2.6.116 but there are timeouts before and after the update + revert didn't fix the issue; we see a slight increase of the Redis server max load but it's still well below the 40-50% (with just a few spikes to 70-80%); the average Redis server load is < 2.5% at all times but it also shows a slight increase in the last 14 days.

I realize that the issue is most probably in result of some internal change in the Azure Cache for Redis resource. @NickCraver , can you please point us where to look for the issue?

NickCraver · 2023-06-28T22:41:00Z

@vladislav-karamfilov I'd check SLOWLOG on the server in your case, it is possible that hash is quite a large payload?

vladislav-karamfilov · 2023-07-03T08:00:10Z

@NickCraver , we have investigated with SLOWLOG and found out that there were some commands with too many params which we have split into multiple commands; we have also optimized the stored value sizes for all Redis collections (although the hashes where we store them are very small - values in them are constantly being added and then removed so in 1 moment of time there are very few records); we have also increased the size of the worker threads created on demand without waiting because we have seen some timeout logs with busy worker threads > min worker threads. We hope that this will fix the issue because every other suggestion for improvement we found is already in place.

PS The exception message contains "mgr: 10 of 10 available" which at first I thought are the threads in the dedicated thread pool that are free to do some work, but having looked at the source code I found out that these are actually the threads that are actively doing some work. I know it's a nit but the message is somehow misleading.

NickCraver · 2023-07-04T12:21:49Z

Good to hear you found goodies!

PS The exception message contains "mgr: 10 of 10 available" which at first I thought are the threads in the dedicated thread pool that are free to do some work, but having looked at the source code I found out that these are actually the threads that are actively doing some work. I know it's a nit but the message is somehow misleading.

This is incorrect, the message is correct. It does however look like the comment in Pipelines.Sockets.Unofficial is backwards... (cc @mgravell): https://github.com/mgravell/Pipelines.Sockets.Unofficial/blob/0438f06057b0fc2e4edb9d7b4d2d6019e2933261/src/Pipelines.Sockets.Unofficial/DedicatedThreadPoolPipeScheduler.cs#L147

vladislav-karamfilov · 2023-07-04T18:09:37Z

Yes, that's exactly where I found the info that these active threads in the message are the ones that are busy with work, not the ones that are free to do some work. But if that's incorrect, why does the library use .NET's thread pool WORKER: (Busy=4,Free=32766,Min=2,Max=32767) instead of its own thread pool mgr: 10 of 10 available when it's completely free to do work? Shouldn't the dedicated thread pool be all busy before switching to the .NET's one?

NickCraver · 2023-07-04T18:18:59Z

These are for different things. The built-in thread pool is specifically for handling socket IO and making sure that stays flowing, which often unblocks some bad sync-over-async code upstream in applications calling us (preventing extended deadlocks that would otherwise occur with that pattern). It's takes the data off the socket and hands the task to the .NET thread pool for completing you want to do with it. It doesn't do all the work end to end, that would defeat the entire purpose of a dedicated thread pool for socket IO, because it'd be exhausted pretty much constantly.

You're assuming that because something is in an error message to help us diagnose, that we're using it directly. This isn't a valid assumption. What's in the error message is information to help us best advise those hitting it how they can improve the situation, and that includes some factors completely outside our control that do manifest in stalls/timeouts/etc., thread pool exhaustion being the absolute #1 thing people hit in their applications...we're just the ones alerting them that it's happening and that's why they get latency spikes.

vladislav-karamfilov · 2023-07-08T07:53:24Z

@NickCraver , thanks for clarifying this! 🙇‍♂️

We are monitoring our app logs and the Redis resource in Azure and it seems to be working fine after the improvements we've made. Thank you for the suggestion!

pianoman4873 · 2023-12-10T09:16:38Z

Adding more to this thread to avoid duplication -

I'm getting timeouts of this type -

(outbound=1KiB, inbound=0KiB, 30468ms elapsed, timeout is 30000ms), command=GET, next: INFO, inst: 0, qu: 0, qs: 109, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 131072, in-pipe: 0, out-pipe: 0, last-in: 1847, cur-in: 0, sync-ops: 223438, async-ops: 723026, serverEndpoint: 10.71.87.234:6379, conn-sec: 10458.31, aoc: 1, mc: 1/1/0, mgr: 9 of 10 available, clientName: IP-0A475518(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 8459, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=1090,Free=31677,Min=8,Max=32767), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

next: INFO, inst: 222, qu: 0, qs: 619, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 131072, in-pipe: 0, out-pipe: 0, last-in: 8191, cur-in: 0, sync-ops: 223435, async-ops: 723010, serverEndpoint: 10.71.86.100:6379, conn-sec: 10458.3, aoc: 1, mc: 1/1/0, mgr: 10 of 10 available, clientName: IP-0A475518(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 260, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=1090,Free=31677,Min=8,Max=32767), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause

next: GET U_IPES.1:S_3017822, inst: 0, qu: 4, qs: 617, aw: True, bw: CheckingForTimeoutComplete, rs: ReadAsync, ws: Flushing, in: 131072, in-pipe: 0, out-pipe: 32, last-in: 256, cur-in: 1211, sync-ops: 223449, async-ops: 723073, serverEndpoint: 10.71.87.227:6379, conn-sec: 10458.57, aoc: 1, mc: 1/1/0, mgr: 8 of 10 available, clientName: IP-0A475518(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 260, IOCP: (Busy=4,Free=996,Min=8,Max=1000), WORKER: (Busy=1090,Free=31677,Min=8,Max=32767), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts:

next: INFO, inst: 45, qu: 0, qs: 619, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 131072, in-pipe: 0, out-pipe: 0, last-in: 404, cur-in: 0, sync-ops: 223435, async-ops: 723010, serverEndpoint: 10.71.87.227:6379, conn-sec: 10458.3, aoc: 1, mc: 1/1/0, mgr: 10 of 10 available, clientName: IP-0A475518(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 260, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=1090,Free=31677,Min=8,Max=32767), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

xt: INFO, inst: 0, qu: 1, qs: 624, aw: True, bw: CheckingForTimeoutComplete, rs: ReadAsync, ws: Flushing, in: 131072, in-pipe: 661, out-pipe: 59, last-in: 8191, cur-in: 0, sync-ops: 223445, async-ops: 723058, serverEndpoint: 10.71.86.100:6379, conn-sec: 10458.34, aoc: 1, mc: 1/1/0, mgr: 8 of 10 available, clientName: IP-0A475518(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 260, IOCP: (Busy=2,Free=998,Min=8,Max=1000), WORKER: (Busy=1090,Free=31677,Min=8,Max=32767), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts:

They are very similar. The main issue I see is the high amount of WORKER threads ( Busy = 1090 ) -
This could be perhaps due to again the high number of sync-ops: 223445 ?
So the remedy would be to try to move to sync as much as possible ?

Any additional insights ?

Thanks !

PS btw the sync operations don't have to be necessarily Redis related correct ?

NickCraver added ❔ question ⏱️ timeout labels Jun 21, 2023

vladislav-karamfilov mentioned this issue Aug 7, 2023

Timeout unreasonably longer than AsyncTimeout #2477

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout Exception diagnosis #2486

Timeout Exception diagnosis #2486

pianoman4873 commented Jun 21, 2023

NickCraver commented Jun 21, 2023

pianoman4873 commented Jun 21, 2023 •

edited

Loading

vladislav-karamfilov commented Jun 28, 2023

NickCraver commented Jun 28, 2023

vladislav-karamfilov commented Jul 3, 2023 •

edited

Loading

NickCraver commented Jul 4, 2023

vladislav-karamfilov commented Jul 4, 2023

NickCraver commented Jul 4, 2023

vladislav-karamfilov commented Jul 8, 2023

pianoman4873 commented Dec 10, 2023 •

edited

Loading

Timeout Exception diagnosis #2486

Timeout Exception diagnosis #2486

Comments

pianoman4873 commented Jun 21, 2023

NickCraver commented Jun 21, 2023

pianoman4873 commented Jun 21, 2023 • edited Loading

vladislav-karamfilov commented Jun 28, 2023

NickCraver commented Jun 28, 2023

vladislav-karamfilov commented Jul 3, 2023 • edited Loading

NickCraver commented Jul 4, 2023

vladislav-karamfilov commented Jul 4, 2023

NickCraver commented Jul 4, 2023

vladislav-karamfilov commented Jul 8, 2023

pianoman4873 commented Dec 10, 2023 • edited Loading

pianoman4873 commented Jun 21, 2023 •

edited

Loading

vladislav-karamfilov commented Jul 3, 2023 •

edited

Loading

pianoman4873 commented Dec 10, 2023 •

edited

Loading