-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Timeout Exception diagnosis #2486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The main problem you have here is thread pool exhaustion ( If you take a memory dump under load, you can see where all those threads are stuck, it may be 1 or 2 hot spots worth making async ASAP. |
Thanks a lot Nick. |
In order to prevent potential duplicate issue, I'll reuse this one to ask for help to figure out (potentially) similar problem. During the last 14 days we observe an increased amount of
We are using Azure Cache for Redis and we haven't made any changes on this resource; we haven't made any changes to how we access the data in Redis; the only related change we did during this period is to update the I realize that the issue is most probably in result of some internal change in the Azure Cache for Redis resource. @NickCraver , can you please point us where to look for the issue? |
@vladislav-karamfilov I'd check SLOWLOG on the server in your case, it is possible that hash is quite a large payload? |
@NickCraver , we have investigated with SLOWLOG and found out that there were some commands with too many params which we have split into multiple commands; we have also optimized the stored value sizes for all Redis collections (although the hashes where we store them are very small - values in them are constantly being added and then removed so in 1 moment of time there are very few records); we have also increased the size of the worker threads created on demand without waiting because we have seen some timeout logs with busy worker threads > min worker threads. We hope that this will fix the issue because every other suggestion for improvement we found is already in place. PS The exception message contains "mgr: 10 of 10 available" which at first I thought are the threads in the dedicated thread pool that are free to do some work, but having looked at the source code I found out that these are actually the threads that are actively doing some work. I know it's a nit but the message is somehow misleading. |
Good to hear you found goodies!
This is incorrect, the message is correct. It does however look like the comment in Pipelines.Sockets.Unofficial is backwards... (cc @mgravell): https://github.com/mgravell/Pipelines.Sockets.Unofficial/blob/0438f06057b0fc2e4edb9d7b4d2d6019e2933261/src/Pipelines.Sockets.Unofficial/DedicatedThreadPoolPipeScheduler.cs#L147 |
Yes, that's exactly where I found the info that these active threads in the message are the ones that are busy with work, not the ones that are free to do some work. But if that's incorrect, why does the library use .NET's thread pool |
These are for different things. The built-in thread pool is specifically for handling socket IO and making sure that stays flowing, which often unblocks some bad sync-over-async code upstream in applications calling us (preventing extended deadlocks that would otherwise occur with that pattern). It's takes the data off the socket and hands the task to the .NET thread pool for completing you want to do with it. It doesn't do all the work end to end, that would defeat the entire purpose of a dedicated thread pool for socket IO, because it'd be exhausted pretty much constantly. You're assuming that because something is in an error message to help us diagnose, that we're using it directly. This isn't a valid assumption. What's in the error message is information to help us best advise those hitting it how they can improve the situation, and that includes some factors completely outside our control that do manifest in stalls/timeouts/etc., thread pool exhaustion being the absolute #1 thing people hit in their applications...we're just the ones alerting them that it's happening and that's why they get latency spikes. |
@NickCraver , thanks for clarifying this! 🙇♂️ We are monitoring our app logs and the Redis resource in Azure and it seems to be working fine after the improvements we've made. Thank you for the suggestion! |
Adding more to this thread to avoid duplication - I'm getting timeouts of this type - (outbound=1KiB, inbound=0KiB, 30468ms elapsed, timeout is 30000ms), command=GET, next: INFO, inst: 0, qu: 0, qs: 109, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 131072, in-pipe: 0, out-pipe: 0, last-in: 1847, cur-in: 0, sync-ops: 223438, async-ops: 723026, serverEndpoint: 10.71.87.234:6379, conn-sec: 10458.31, aoc: 1, mc: 1/1/0, mgr: 9 of 10 available, clientName: IP-0A475518(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 8459, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=1090,Free=31677,Min=8,Max=32767), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts) next: INFO, inst: 222, qu: 0, qs: 619, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 131072, in-pipe: 0, out-pipe: 0, last-in: 8191, cur-in: 0, sync-ops: 223435, async-ops: 723010, serverEndpoint: 10.71.86.100:6379, conn-sec: 10458.3, aoc: 1, mc: 1/1/0, mgr: 10 of 10 available, clientName: IP-0A475518(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 260, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=1090,Free=31677,Min=8,Max=32767), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause next: GET U_IPES.1:S_3017822, inst: 0, qu: 4, qs: 617, aw: True, bw: CheckingForTimeoutComplete, rs: ReadAsync, ws: Flushing, in: 131072, in-pipe: 0, out-pipe: 32, last-in: 256, cur-in: 1211, sync-ops: 223449, async-ops: 723073, serverEndpoint: 10.71.87.227:6379, conn-sec: 10458.57, aoc: 1, mc: 1/1/0, mgr: 8 of 10 available, clientName: IP-0A475518(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 260, IOCP: (Busy=4,Free=996,Min=8,Max=1000), WORKER: (Busy=1090,Free=31677,Min=8,Max=32767), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts: next: INFO, inst: 45, qu: 0, qs: 619, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 131072, in-pipe: 0, out-pipe: 0, last-in: 404, cur-in: 0, sync-ops: 223435, async-ops: 723010, serverEndpoint: 10.71.87.227:6379, conn-sec: 10458.3, aoc: 1, mc: 1/1/0, mgr: 10 of 10 available, clientName: IP-0A475518(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 260, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=1090,Free=31677,Min=8,Max=32767), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts) xt: INFO, inst: 0, qu: 1, qs: 624, aw: True, bw: CheckingForTimeoutComplete, rs: ReadAsync, ws: Flushing, in: 131072, in-pipe: 661, out-pipe: 59, last-in: 8191, cur-in: 0, sync-ops: 223445, async-ops: 723058, serverEndpoint: 10.71.86.100:6379, conn-sec: 10458.34, aoc: 1, mc: 1/1/0, mgr: 8 of 10 available, clientName: IP-0A475518(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 260, IOCP: (Busy=2,Free=998,Min=8,Max=1000), WORKER: (Busy=1090,Free=31677,Min=8,Max=32767), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts: They are very similar. The main issue I see is the high amount of WORKER threads ( Busy = 1090 ) - Any additional insights ? Thanks ! PS btw the sync operations don't have to be necessarily Redis related correct ? |
Hello ,
Could you please help diagnose the following timeout exception -
StackExchange.Redis.RedisTimeoutException: Timeout performing EXISTS (30000ms), next: INFO, inst: 61, qu: 0, qs: 294, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 95891, in-pipe: 0, out-pipe: 0, last-in: 58, cur-in: 0, sync-ops: 25616118, async-ops: 400996552, serverEndpoint: 10.71.88.247:6379, conn-sec: 128327.66, mc: 1/1/0, mgr: 10 of 10 available, clientName: IP-0A475317(SE.Redis-v2.6.90.64945), PerfCounterHelperkeyHashSlot: 16275, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=173,Free=32594,Min=16,Max=32767), POOL: (Threads=191,QueuedItems=5686,CompletedItems=2991371194), v: 2.6.90.64945
What caught my eye in:95891 -
I assume a lot of data is pending in the queue so maybe that has to do with limited resources on the machine running the client ?
BTW - Could this he somehow related to upgrading the OS in the client machines to Windows Server 2019 ( it was 2012 before ) ?
If you could add any more insights that would be great.
We didn't see any increase load on the Redis servers that coincided with the exception time.
Thanks a lot !
The text was updated successfully, but these errors were encountered: