-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Get & Set Failing After Error Cascade #1438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If the timeout errors would be helpful, I can provide them too. Redacting them is more timestaking, and I wanted to get this up. |
That's... very curious; the stack trace gives me a lot of context here - I
can see somehow we're attempting concurrent IO from flushbacklog. Is this
repeatable? I'm trying to think of there's anything I can do to directly
repro it
…On Sat, 18 Apr 2020, 01:26 JKurzer, ***@***.***> wrote:
If the timeout errors would be helpful, I can provide them too. Redacting
them is more timestaking, and I wanted to get this up.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1438 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAEHMCRSDD554CN7DGIT5LRNDXULANCNFSM4MLDWLWA>
.
|
FYI - we have a customer getting this problem on version 2.1.30.38891. I will see if we can any details on how to reproduce this... |
We are also on v: 2.1.30.38891. Our repro is inconsistent at best, and only occurs after a soak under production load levels. Unfortunately, we can't go live until I have a clean line to resolve. It does seem likely that this is happening during .net threadpool exhaustion. We are also seeing fairly regular time-outs (reproduced at the end) from symptomatic machines prior to and during the fatal cascade. I believe this is an orthogonal issue, but I mention it for completeness. These appear to be both due to and causing the threadpool exhaustion, as indicated in WORKER. Our redis metrics and slowlog do not report issues. Timeout Error:
|
I may have a small additional lead. We're seeing another form of the same error, much rarer but a bit more informative maybe, since it shows what's happening to the task.
|
Small update: Once a machine is symptomatic, we see extremely rapid repeated concurrent read or write errors on the same key, as it looks like the metastable task is getting repeatedly flushed. A failure mode is always preceded by a series of time-outs, for the keys that then are then repeated endlessly in the concurrency errors. |
If this is a regression, It might be related to the recent update to 2.1.11 for Pipelines.Sockets.Unofficial. mgravell/Pipelines.Sockets.Unofficial@3f1a6d9 The new version of the package updates MutexSlim which is used by PhysicalBridge appearing in the stack trace above. This is just speculation, and I will look in to it further when I get a chance. |
We've a rollback headed into test shortly. It is possible that the misbehavior is a fairness bug, actually. |
Well, damn. Yeah, that sounds like the MutexSlim changes aren't 100%. I'll
try to figure out how.
…On Fri, 24 Apr 2020, 01:02 Seth Speaks, ***@***.***> wrote:
If this is a regression, It might be related to the recent update to
2.1.11 for Pipelines.Sockets.Unofficial.
Th new version of the package updates MutexSlim which is used by
PhysicalBridge appearing in the stack trace above.
This is just speculation, and I will look in to it further when I get a
chance.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1438 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAEHMFNDLXEDUX53UC2GTDRODJIFANCNFSM4MLDWLWA>
.
|
Thanks. Let me know if you need us to turn on any logging or add any instrumentation - we don't have a true minrepro but we can reproduce it reliably. I've resisted speculating so far, but it does seem like something is letting a task get sliced in a way that causes it to be both failed and flushed, which then triggers the transparent retry handling. Because this only arises during a wave of simultaneous time-outs, it's likely there's some contention in this circumstance that's causing a fatal violation of assumed execution order where either a guarantee has failed or a guarantee did not actually exist. |
@mgravell : Comparable issue occurs with rolled back version. SE Redis Client is 2.0.601 Error cascade of:
Looks like it could be thread theft rather than the same pathology, but it does mean that rollback isn't an option for us, unfortunately. Anything we can do to help? |
I've just seen this happen in our environment using SE.Redis This began around the time a Redis server failed over to a replica and so there were connection interruptions. This connection has remained in a broken state and not recovered for 15+ minutes (application was restarted to fix). We got these kinds of messages:
|
This happened again for us today, 3 separate machines, shortly after a redis-side socket disconnection occurred as above. Same stack traces. |
I've been trying to reproduce this and isolate the problem but its providing difficult. I tried injecting random faults in the connection and lock management code to try and get into a broken state, but no luck. I do note and wonder if this is a potential issue that could get the pipes concurrently flushing though, but could not reproduce it:
However, the writer could still be trying to progress (the timeout didn't cancel the write, its proceeding in the background). At this point, another thread could acquire the lock and try another write, leading to concurrent writes to the pipe. |
We are also seeing this issue consistently right after we upgraded from 2.0.513 to 2.1.30. Will this issue be fixed soon? If not, we will consider to roll back to the old version. |
For a short-term fix, to continue with testing, we now teardown the multiplexer when we see the error. The error only arises every few hours, and the teardown is a low cost. It's not acceptable for production operation, however, and I'd still like to see this fixed. The fact that this resolves the issue is interesting in and of itself. |
Bumping this. Unfortunately, it's becoming a blocking issue for us to deploy to production. Are there next steps we can take that would assist in reproduction? We have some load-test code that induces the fault reliably after a prolonged soak, but I'm not sure how we'd grant you access to it since it hinges on our cache implementation. I'm open to suggestions. |
Are there any updates on this issue? We're also running into this issue under prod load with version 2.1.30. |
@mgravell besides #1438 (comment) thoughts about potential issue, another thought is that perhaps the locking code is correct for PhysicalBridge, rather there just happens (due to disconnect/reconnect cycle) now two or more instances re-using the same IOPipe Is it possible perhaps during connection issues and reconnect logic that a new physical connection is re-created that is given the same I see code in |
is downgrading Redis.dll the workaround for this? |
Hi @furlongmt how often/reliably are you seeing this error, and can I confirm, is it the "Failed to write" and "Concurrent reads or writes are not supported" error? Do you get it when there is a socket issue/disconnection? I've tried to reproduce this multiple times by editing the code base and injecting faults, but not able to get this exact issue. I'm still looking, as it hits our production environment during socket issues. If you can reliably get the problem, perhaps I can make a redis build you could use that has extra logging in place, so we get more information about what's happening during this error? |
Hi @Plasma, yes it is the "Failed to write" and "Concurrent reads or writes are not supported error". It seems to trigger periodically, but I'm not sure at this point if it's due to a socket disconnection or not yet. I also have not been able to reproduce the program locally, we only see it in our production environment from time to time. |
Could the latest release possibly fix this? I see it was just released recently: https://stackexchange.github.io/StackExchange.Redis/ReleaseNotes#2155 We are also having this issue during regression load test runs. We are going to run ours tests against 2.1.55 and see if it is resolved. I should know by tomorrow. |
It isn't a specific target of that release. I'm hoping to look into
connection stability issues now that I've freed up my plate a little -
should be starting this week.
…On Tue, 9 Jun 2020, 18:03 Brian Lee, ***@***.***> wrote:
Could the latest release possibly fix this? I see it was just released
recently:
https://stackexchange.github.io/StackExchange.Redis/ReleaseNotes#2155
We are also having this issue during regression load test runs. We are
going to run ours tests against 2.1.55 and see if it is resolved. I should
know by tomorrow.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1438 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAEHMFWHZFRQ2UFNDKOEE3RVZTMVANCNFSM4MLDWLWA>
.
|
@bdlee420 If you are able to reliably reproduce the issue using your tests, wondering if you'd be keen to install a redis build that has some extra debug message code for a test CI build that I can add, to help print some variables, that may help track down the cause of this problem? Thanks |
I received the results from the test run with 2.1.55 and it is mixed. We didn't get any concurrency errors but instead received 28 timeouts in a 20 min window: StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=3385KiB, inbound=11KiB, 5063ms elapsed, timeout is 5000ms), command=DEL, next: HGETALL somekey, inst: 0, qu: 0, qs: 7315, aw: True, rs: DequeueResult, ws: Writing, in: 65536, serverEndpoint: someserver, mc: 1/1/0, mgr: 0 of 10 available, clientName: someclientname, PerfCounterHelperkeyHashSlot: 9024, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=2552,Free=30215,Min=4,Max=32767), v: 2.1.55.31085 Now I have a couple options:
I am sort of leaning towards #1 so I can make sure what I am seeing is consistent, it is just painful as these regression tests take several hours :). @Plasma, I am definitely OK with installing a preview edition with some additional logging and trying it out. |
I decided to go ahead and run the tests on 2.1.30 again. Some further notes, all tests have been run against an API that is running on .NET Framework 4.6.1. We have a .NET Core 2.2 version that we have just released and we will now start testing on that version as well. I will post the results once I get them, it will probably take a couple days. |
@bdlee420 I have put together a fork at https://github.com/StackExchange/StackExchange.Redis/compare/master...Plasma:v2.1.30-debug-logs?w=1 that is based off v2.1.30 tag, that includes a bit more info that may help us track down the issue next time it re-occurs. The tree is at https://github.com/Plasma/StackExchange.Redis/tree/v2.1.30-debug-logs If you were able to reproduce this in your CI environment, perhaps you can pull my branch above, do a Release build, and use that binary for CI, and hopefully the issue reproduces with a bit more detailed info? |
Can I get a pulse on this? |
Also suffering this error in a production environment, would appreciate an update! |
I am also seeing this in our production environment, @bdlee420 @jessieli-ad You both seem to indicate the 2.0.519/2.0.513 may not suffer from the issue, have you rolled back your version? I am thinking that is the best option until this is resolved. |
@arsnyder16 , we rolled back to 2.0.513. But 2.0.513 has timeout problem. We are looking for upgrade when this issue is resolved. |
@jessieli-ad Did you try 2.0.588 by chance? Release notes seem to indicate a fix around timeouts |
We tried 2.0.588 and that seemed to have the following issue: I am playing version Whac-A-Mole |
@Plasma Any update from your end? We are working to stabilize our production environment first, then we might be able to help you with your forked branch and repro the issue and get you the extra debug information. We could probably get to this quicker if you were able to publish this fork to nuget for us to pull, it will be easier for us to tweak our deployment to pull that version than to manually try to inject it |
Hey @arsnyder16 I'm only encountering these kinds of errors on the bi-monthly (?) moments Azure introduces networking/connectivity/failover issues, its rare but when it happens its regular. I hear you about the nuget feed, ill have a think. Until then, to help stabilize your prod env if you are still having issues, can I suggest also ensuring you have keep-alives and Pings turned on (as part of redis connection options) to ensure idle sockets aren't disconnected (which can lead to these kinds of issues too, I think). Typically our environment is stable, its just during failover/networking issues where things can become interrupted. |
@Plasma Thanks for the advice, we do not have keep alive configured. Do you have a recommendation on a good ping interval ? Is there a reason this is the default behavior to make sure these sockets don't become stale? |
You want to turn on keep-alive as part of ConfigurationOptions: var options = ConfigurationOptions.Parse(...);
options.KeepAlive = 10; // time in seconds We have ours set to 10 seconds. If you're running in Azure (or other cloud providers as well), or things are configured to be this way, servers/infrastructure can quietly disconnect connections that are idle and aren't receiving traffic. The keep-alive above actually sends a redis PING command down the socket to make it look busy and health check it. It can be a problem if you have periods of idleness in your app where Redis may not be used, so the connection is idled/disconnected, then your app goes to use it and starts getting errors. I don't know if that's what you are experiencing, but my suggestion for best practice would be to have that enabled, and of course check that you aren't overloading your redis server/s in terms of Mbps or CPU. |
@Plasma Thank you, this is very helpful we are going to set KeepAlive. we are using azure redis cache so it seems likely this will help us out. |
@arsnyder16 out of interest did that happen to help your prod environment? |
@Plasma I'm new to this thread but we experienced this a couple weeks ago. I implemented the KeepAlive change like you suggested above when I first stumbled upon this thread, but this morning processes that use SE.Redis (run from docker images in containers on a bare-metal linux machine we have) went back to throwing these errors over and over. Everything was fixed after I executed docker-compose down/up on all the services using StackExchange.Redis to our instance. We do have quite a few processes writing to and reading from this Redis instance pretty much all the time. Looking at what you said above, is this not supported?
Btw we're presently on |
Hi @dgunderson the keep-alive suggestion was specifically for arsnyder16 who was getting some other kind of error, but the original error in this thread (about the never-recovering concurrent read/write unsupported) is far as I know still a potential issue - we are planning to update to the latest SE.Redis in the future but we haven't had it re-occur yet. Restarting the process is my only known fix, too. |
@Plasma Sorry for the delay, it took us a little to get this into production, but no it did not help us we are still seeing this. We are using 2.1.58, on Azure Cache For Redis, based on the server metrics, it appears there is very little load on the server at that time. we got roughly 425 of this with a 5 minute just this morning around 7amET
we are also saw this around the same time, not that high of a count though ~15 times
|
looking closer at our logs, looks like something we missed was some of our timeout we are logging as warnings so i was missing them around that same time we are ssing
|
Out of interest if you load up Azure portal metrics for your cache instance, set the metric to Errors, if you then Split the metric in the UI, does it say a Failover happened around that time? In our experience the 'Concurrent reads or writes are not supported' error (which we have not been able to fix) happens as a side-effect of some connectivity issue. It's only happened twice, and sometimes a failover happens without issue, seems to be a race condition that can cause the problem - and I don't have a solution (I'm just a user of the library, not a maintainer). I did try and reproduce this and step through the code but was unable to reproduce it. Earlier comments in this issue suggested it was a downstream library issue that was rolled back in later versions of SE.Redis, but we haven't upgraded just yet. |
@Plasma I am not seeing a correlation. So over the past 30 days this has happen 19 times, there have been 6 redis errors all failovers the closest to the 'Concurrent reads or writes' errors was 3 hours, otherwise they were at least 12 hours apart |
This is a client-side race condition and is caused by a regression post 2.0.513. Right now, the easiest resolution is to walk back to the earlier version, but that version has non-trivial issues in thread management among other things and has its own time-out problems as a result. While those don't cascade, they are pretty rough. Right now, we're really at an impasse re: deployability. The earliest incidence of the issue I'm aware of is in 2.0.601 and while that's a significant number of code changes, it's a finite number. If it would help, I'm willing to put up a bug bounty. --J |
I was able to put together a console app that produces the issue. I haven't begun debugging yet, but i thought i would pass along in case someone more familiar with the code base would also be interesting in jumping in. It is not super consistent and some of the things might need adjust per machine. I would recommend running under Release configuration. Seems to occur around the time there start become a lot of TimeoutExceptions. Attached is a cs file, i am using v2.1.58, with an azure redis cache, Standard 2.5 GB |
@Plasma @sassywarsat @mgravell I think I got the root of the issue, I have submitted a PR #1585 |
Aloha!
The Problem
Under certain loads and after long up-times, we've begun seeing a curious error during addorset (Reproduced at the end).
This is accompanied by a much higher number of errors for Gets (Reproduced at the end) that are roughly identical, though we do use get far more than set.
Additional
Once a client becomes symptomatic, it rapidly begins throwing these errors on most or all operations. Bouncing that client causes another client to become symptomatic in a similar way, which may be a red herring but was interesting enough to bear mention.
We currently use only one multiplexer, which seems like a possible cause, but before we refactor, I thought I'd check in. Our application is pretty much pure C#. We are seeing time-outs as well, during symptomatic periods. My assumption is that StackExchange.Redis/ConnectionMultiplexer.cs#L2601 is the top level call in the multiplexer for the error, given that 2622 is part of error handling deferral.
Set Error:
Get Error:
The text was updated successfully, but these errors were encountered: