Get & Set Failing After Error Cascade #1438

JKurzer · 2020-04-18T00:02:03Z

Aloha!

The Problem

Under certain loads and after long up-times, we've begun seeing a curious error during addorset (Reproduced at the end).
This is accompanied by a much higher number of errors for Gets (Reproduced at the end) that are roughly identical, though we do use get far more than set.

Additional

Once a client becomes symptomatic, it rapidly begins throwing these errors on most or all operations. Bouncing that client causes another client to become symptomatic in a similar way, which may be a red herring but was interesting enough to bear mention.

We currently use only one multiplexer, which seems like a possible cause, but before we refactor, I thought I'd check in. Our application is pretty much pure C#. We are seeing time-outs as well, during symptomatic periods. My assumption is that StackExchange.Redis/ConnectionMultiplexer.cs#L2601 is the top level call in the multiplexer for the error, given that 2622 is part of error handling deferral.

Set Error:

REDACTED REDIS AddOrSet Failure on key REDACTED
    StackExchange.Redis.RedisConnectionException: Failed to write ---> System.InvalidOperationException: Concurrent reads or writes are not supported.
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.IO.Pipelines.PipeCompletion.ThrowLatchedException()
       at System.IO.Pipelines.Pipe.GetFlushResult(FlushResult& result)
       at System.IO.Pipelines.Pipe.PrepareFlush(CompletionData& completionData, ValueTask`1& result, CancellationToken cancellationToken)
       at System.IO.Pipelines.Pipe.FlushAsync(CancellationToken cancellationToken)
       at System.IO.Pipelines.Pipe.DefaultPipeWriter.FlushAsync(CancellationToken cancellationToken)
       at Pipelines.Sockets.Unofficial.SocketConnection.WrappedWriter.FlushAsync(CancellationToken cancellationToken) in /_/src/Pipelines.Sockets.Unofficial/SocketConnection.cs:line 434
       at StackExchange.Redis.PhysicalConnection.FlushAsync(Boolean throwOnFailure) in /_/src/StackExchange.Redis/PhysicalConnection.cs:line 907
       at StackExchange.Redis.PhysicalBridge.ProcessBacklog() in /_/src/StackExchange.Redis/PhysicalBridge.cs:line 889
       --- End of inner exception stack trace ---
       at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 2622
       at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) in /_/src/StackExchange.Redis/RedisBase.cs:line 54

Get Error:

REDACTED REDIS GetAsync Failure on key REDACTED
    StackExchange.Redis.RedisConnectionException: Failed to write ---> System.InvalidOperationException: Concurrent reads or writes are not supported.
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.IO.Pipelines.PipeCompletion.ThrowLatchedException()
       at System.IO.Pipelines.Pipe.GetFlushResult(FlushResult& result)
       at System.IO.Pipelines.Pipe.PrepareFlush(CompletionData& completionData, ValueTask`1& result, CancellationToken cancellationToken)
       at System.IO.Pipelines.Pipe.FlushAsync(CancellationToken cancellationToken)
       at System.IO.Pipelines.Pipe.DefaultPipeWriter.FlushAsync(CancellationToken cancellationToken)
       at Pipelines.Sockets.Unofficial.SocketConnection.WrappedWriter.FlushAsync(CancellationToken cancellationToken) in //src/Pipelines.Sockets.Unofficial/SocketConnection.cs:line 434
       at StackExchange.Redis.PhysicalConnection.FlushAsync(Boolean throwOnFailure) in //src/StackExchange.Redis/PhysicalConnection.cs:line 907
       at StackExchange.Redis.PhysicalBridge.WriteMessageTakingWriteLockAsync(PhysicalConnection physical, Message message) in /_/src/StackExchange.Redis/PhysicalBridge.cs:line 987
       --- End of inner exception stack trace ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

The text was updated successfully, but these errors were encountered:

JKurzer · 2020-04-18T00:26:33Z

If the timeout errors would be helpful, I can provide them too. Redacting them is more timestaking, and I wanted to get this up.

mgravell · 2020-04-18T09:03:57Z

That's... very curious; the stack trace gives me a lot of context here - I can see somehow we're attempting concurrent IO from flushbacklog. Is this repeatable? I'm trying to think of there's anything I can do to directly repro it

…

On Sat, 18 Apr 2020, 01:26 JKurzer, ***@***.***> wrote: If the timeout errors would be helpful, I can provide them too. Redacting them is more timestaking, and I wanted to get this up. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1438 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAEHMCRSDD554CN7DGIT5LRNDXULANCNFSM4MLDWLWA> .

JonCole · 2020-04-20T16:48:44Z

FYI - we have a customer getting this problem on version 2.1.30.38891. I will see if we can any details on how to reproduce this...

JKurzer · 2020-04-20T17:08:32Z

We are also on v: 2.1.30.38891. Our repro is inconsistent at best, and only occurs after a soak under production load levels. Unfortunately, we can't go live until I have a clean line to resolve. It does seem likely that this is happening during .net threadpool exhaustion.

We are also seeing fairly regular time-outs (reproduced at the end) from symptomatic machines prior to and during the fatal cascade. I believe this is an orthogonal issue, but I mention it for completeness. These appear to be both due to and causing the threadpool exhaustion, as indicated in WORKER. Our redis metrics and slowlog do not report issues.

Timeout Error:

REDIS GetAsync Failure on key REDACTED
    StackExchange.Redis.RedisTimeoutException: The timeout was reached before the message could be written to the output buffer, and it was not sent, command=GET,
timeout: 5000, outbound: 9293KiB, inbound: 0KiB, next: SETEX REDACTED, inst: 20, qu: 192, qs: 14, aw: True, bw: RecordingTimeout, rs: ReadAsync, ws: Flushed, in: 0, in-pipe: 0, out-pipe: 243792,
serverEndpoint: 10.127.11.35:6001, mc: 1/1/0, mgr: 9 of 10 available, clientName: REDACTED, PerfCounterHelperkeyHashSlot: 11679, IOCP: (Busy=1,Free=999,Min=24,Max=1000), WORKER: (Busy=56,Free=32711,Min=24,Max=32767),
v: 2.1.30.38891 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at REDACTED.Engine.Common.Caching.RedisCache.GetAsync()

JKurzer · 2020-04-21T22:00:08Z

I may have a small additional lead. We're seeing another form of the same error, much rarer but a bit more informative maybe, since it shows what's happening to the task.

Unobserved Task Exception caught, sender is System.Threading.Tasks.Task1[StackExchange.Redis.WriteResult]: 
    System.AggregateException: A Task's exception(s) were not observed either by Waiting on the Task or accessing its Exception property. As a result, the unobserved exception was rethrown by the finalizer thread. ---> System.InvalidOperationException: Concurrent reads or writes are not supported.
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.IO.Pipelines.PipeCompletion.ThrowLatchedException()
       at System.IO.Pipelines.Pipe.GetFlushResult(FlushResult& result)
       at System.IO.Pipelines.Pipe.PrepareFlush(CompletionData& completionData, ValueTask1& result, CancellationToken cancellationToken)
       at System.IO.Pipelines.Pipe.FlushAsync(CancellationToken cancellationToken)
       at System.IO.Pipelines.Pipe.DefaultPipeWriter.FlushAsync(CancellationToken cancellationToken)
       at Pipelines.Sockets.Unofficial.SocketConnection.WrappedWriter.FlushAsync(CancellationToken cancellationToken) in //src/Pipelines.Sockets.Unofficial/SocketConnection.cs:line 434
       at StackExchange.Redis.PhysicalConnection.FlushAsync(Boolean throwOnFailure) in //src/StackExchange.Redis/PhysicalConnection.cs:line 907
       at StackExchange.Redis.PhysicalBridge.ProcessBacklog() in //src/StackExchange.Redis/PhysicalBridge.cs:line 889
       --- End of inner exception stack trace ---
    ---> (Inner Exception #0) System.InvalidOperationException: Concurrent reads or writes are not supported.
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.IO.Pipelines.PipeCompletion.ThrowLatchedException()
       at System.IO.Pipelines.Pipe.GetFlushResult(FlushResult& result)
       at System.IO.Pipelines.Pipe.PrepareFlush(CompletionData& completionData, ValueTask`1& result, CancellationToken cancellationToken)
       at System.IO.Pipelines.Pipe.FlushAsync(CancellationToken cancellationToken)
       at System.IO.Pipelines.Pipe.DefaultPipeWriter.FlushAsync(CancellationToken cancellationToken)
       at Pipelines.Sockets.Unofficial.SocketConnection.WrappedWriter.FlushAsync(CancellationToken cancellationToken) in //src/Pipelines.Sockets.Unofficial/SocketConnection.cs:line 434
       at StackExchange.Redis.PhysicalConnection.FlushAsync(Boolean throwOnFailure) in //src/StackExchange.Redis/PhysicalConnection.cs:line 907
       at StackExchange.Redis.PhysicalBridge.ProcessBacklog() in //src/StackExchange.Redis/PhysicalBridge.cs:line 889

JKurzer · 2020-04-22T08:15:39Z

Small update:

Once a machine is symptomatic, we see extremely rapid repeated concurrent read or write errors on the same key, as it looks like the metastable task is getting repeatedly flushed.

A failure mode is always preceded by a series of time-outs, for the keys that then are then repeated endlessly in the concurrency errors.

sspeaks610 · 2020-04-24T00:01:57Z

If this is a regression, It might be related to the recent update to 2.1.11 for Pipelines.Sockets.Unofficial.

mgravell/Pipelines.Sockets.Unofficial@3f1a6d9

The new version of the package updates MutexSlim which is used by PhysicalBridge appearing in the stack trace above.

This is just speculation, and I will look in to it further when I get a chance.

JKurzer · 2020-04-24T02:40:50Z

We've a rollback headed into test shortly. It is possible that the misbehavior is a fairness bug, actually.

mgravell · 2020-04-24T07:57:36Z

Well, damn. Yeah, that sounds like the MutexSlim changes aren't 100%. I'll try to figure out how.

…

On Fri, 24 Apr 2020, 01:02 Seth Speaks, ***@***.***> wrote: If this is a regression, It might be related to the recent update to 2.1.11 for Pipelines.Sockets.Unofficial. Th new version of the package updates MutexSlim which is used by PhysicalBridge appearing in the stack trace above. This is just speculation, and I will look in to it further when I get a chance. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1438 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAEHMFNDLXEDUX53UC2GTDRODJIFANCNFSM4MLDWLWA> .

JKurzer · 2020-04-24T19:33:38Z

Thanks. Let me know if you need us to turn on any logging or add any instrumentation - we don't have a true minrepro but we can reproduce it reliably.

I've resisted speculating so far, but it does seem like something is letting a task get sliced in a way that causes it to be both failed and flushed, which then triggers the transparent retry handling.

Because this only arises during a wave of simultaneous time-outs, it's likely there's some contention in this circumstance that's causing a fatal violation of assumed execution order where either a guarantee has failed or a guarantee did not actually exist.

JKurzer · 2020-04-30T18:38:25Z

@mgravell : Comparable issue occurs with rolled back version.

SE Redis Client is 2.0.601
pipelines.sockets.unofficial is 2.0.22
.net 4.72~

Error cascade of:

REDACTED REDIS GetAsync Failure on key REDACTED
    StackExchange.Redis.RedisConnectionException: No connection is available to service this operation: GET REDACTED; IOCP: (Busy=0,Free=1000,Min=256,Max=1000), WORKER: (Busy=6,Free=32761,Min=256,Max=32767), Local-CPU: 0%
       at StackExchange.Redis.ConnectionMultiplexer.ThrowFailedT in C:\projects\stackexchange-redis\src\StackExchange.Redis\ConnectionMultiplexer.cs:line 2201
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at REDACTED.Engine.Common.Caching.RedisCache.GetAsync() in d:\REDACTED\next\REDACTED.Engine.Common\Caching\RedisCache.cs:line 708

Looks like it could be thread theft rather than the same pathology, but it does mean that rollback isn't an option for us, unfortunately. Anything we can do to help?

Plasma · 2020-05-11T12:10:10Z

I've just seen this happen in our environment using SE.Redis 2.1.0.

This began around the time a Redis server failed over to a replica and so there were connection interruptions. This connection has remained in a broken state and not recovered for 15+ minutes (application was restarted to fix).

We got these kinds of messages:

Timeout awaiting response (outbound=228KiB, inbound=9393KiB, 10875ms elapsed, timeout is 10000ms), command=EVAL, next: ZADD XXXX, inst: 0, qu: 1568, qs: 77358, aw: True, bw: Flushing, rs: ReadAsync, ws: Flushing, in: 0, serverEndpoint: XXXX, mgr: 9 of 10 available, clientName: WorkerRole_IN_23, IOCP: (Busy=0,Free=1000,Min=10,Max=1000), WORKER: (Busy=3,Free=32764,Min=10,Max=32767), v: 2.1.0.1 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Message: Concurrent reads or writes are not supported.
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw():12
System.IO.Pipelines.PipeCompletion.ThrowLatchedException():11
System.IO.Pipelines.Pipe.GetFlushResult(FlushResult& result):24
System.IO.Pipelines.Pipe.PrepareFlush(CompletionData& completionData, ValueTask`1& result, CancellationToken cancellationToken):54
System.IO.Pipelines.Pipe.FlushAsync(CancellationToken cancellationToken):17
System.IO.Pipelines.Pipe+DefaultPipeWriter.FlushAsync(CancellationToken cancellationToken)
 
StackExchange.Redis.PhysicalConnection.FlushAsync(Boolean throwOnFailure) in C:\projects\stackexchange-redis\src\StackExchange.Redis\PhysicalConnection.cs:907
StackExchange.Redis.PhysicalConnection.FlushSync(Boolean throwOnFailure, Int32 millisecondsTimeout) in C:\projects\stackexchange-redis\src\StackExchange.Redis\PhysicalConnection.cs:879
StackExchange.Redis.PhysicalBridge.WriteMessageTakingWriteLockSync(PhysicalConnection physical, Message message) in C:\projects\stackexchange-redis\src\StackExchange.Redis\PhysicalBridge.cs:736

Message: Concurrent reads or writes are not supported.
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw():12
System.IO.Pipelines.PipeCompletion.ThrowLatchedException():11
System.IO.Pipelines.Pipe.GetFlushResult(FlushResult& result):24
System.IO.Pipelines.Pipe.PrepareFlush(CompletionData& completionData, ValueTask`1& result, CancellationToken cancellationToken):54
System.IO.Pipelines.Pipe.FlushAsync(CancellationToken cancellationToken):17
System.IO.Pipelines.Pipe+DefaultPipeWriter.FlushAsync(CancellationToken cancellationToken)
 
StackExchange.Redis.PhysicalConnection.FlushAsync(Boolean throwOnFailure) in C:\projects\stackexchange-redis\src\StackExchange.Redis\PhysicalConnection.cs:907
StackExchange.Redis.PhysicalConnection.FlushSync(Boolean throwOnFailure, Int32 millisecondsTimeout) in C:\projects\stackexchange-redis\src\StackExchange.Redis\PhysicalConnection.cs:879
StackExchange.Redis.PhysicalBridge.ProcessBacklog() in C:\projects\stackexchange-redis\src\StackExchange.Redis\PhysicalBridge.cs:888

Message: SocketClosed on XXX/Interactive, Flushing/MarkProcessed, last: EVAL, origin: ReadFromPipe, outstanding: 158705, last-read: 0s ago, last-write: 39s ago, keep-alive: 10s, state: ConnectedEstablished, mgr: 9 of 10 available, in: 0, last-heartbeat: 1s ago, last-mbeat: 1s ago, global: 0s ago, v: 2.1.0.1

Plasma · 2020-05-13T06:23:55Z

This happened again for us today, 3 separate machines, shortly after a redis-side socket disconnection occurred as above. Same stack traces.

Plasma · 2020-05-18T05:06:39Z

I've been trying to reproduce this and isolate the problem but its providing difficult.

I tried injecting random faults in the connection and lock management code to try and get into a broken state, but no luck.

I do note and wonder if this is a potential issue that could get the pipes concurrently flushing though, but could not reproduce it:

The stack traces above have ProcessBacklog and WriteMessageTakingWriteLockSync competing threads, both somehow in PhysicalConnection.FlushSync (sync, not async)
The lock acquire/release code seems correct from my testing, so I don't think a lock is being mis-taken/not-taken
In PhysicalConnection.FlushSync, which both stack traces call, if a flush times out, a task.Wait() gives up and throws an exception. When bubbling up, this would release the lock held by the writer.

See https://github.com/StackExchange/StackExchange.Redis/blob/master/src/StackExchange.Redis/PhysicalConnection.cs#L882

However, the writer could still be trying to progress (the timeout didn't cancel the write, its proceeding in the background). At this point, another thread could acquire the lock and try another write, leading to concurrent writes to the pipe.

jessieli-ad · 2020-05-19T16:39:24Z

We are also seeing this issue consistently right after we upgraded from 2.0.513 to 2.1.30. Will this issue be fixed soon? If not, we will consider to roll back to the old version.

JKurzer · 2020-05-20T02:35:34Z

For a short-term fix, to continue with testing, we now teardown the multiplexer when we see the error. The error only arises every few hours, and the teardown is a low cost. It's not acceptable for production operation, however, and I'd still like to see this fixed.

The fact that this resolves the issue is interesting in and of itself.

JKurzer · 2020-05-21T21:35:39Z

Bumping this. Unfortunately, it's becoming a blocking issue for us to deploy to production. Are there next steps we can take that would assist in reproduction? We have some load-test code that induces the fault reliably after a prolonged soak, but I'm not sure how we'd grant you access to it since it hinges on our cache implementation.

I'm open to suggestions.

furlongmt · 2020-06-03T19:20:25Z

Are there any updates on this issue? We're also running into this issue under prod load with version 2.1.30.

Plasma · 2020-06-04T05:51:23Z

@mgravell besides #1438 (comment) thoughts about potential issue, another thought is that perhaps the locking code is correct for PhysicalBridge, rather there just happens (due to disconnect/reconnect cycle) now two or more instances re-using the same IOPipe _ioPipe variable as found in the referenced PhysicalConnection?

Is it possible perhaps during connection issues and reconnect logic that a new physical connection is re-created that is given the same _ioPipe reference as another instance, and so both think they have valid locks, but are accidentally sharing the same resource and this is why this is happening.

I see code in ConnectionMultiplexer for example that has Timers and WeakReferences to things, wondering if there's a moment during disconnect/GC/reconnect/queued commands/etc that somehow > 1 instance references the same _ioPipe in PhysicalConnection.

perosb · 2020-06-05T06:40:24Z

is downgrading Redis.dll the workaround for this?
is it only redis.dll or also its dependencies?

Plasma · 2020-06-07T07:00:03Z

Hi @furlongmt how often/reliably are you seeing this error, and can I confirm, is it the "Failed to write" and "Concurrent reads or writes are not supported" error?

Do you get it when there is a socket issue/disconnection?

I've tried to reproduce this multiple times by editing the code base and injecting faults, but not able to get this exact issue. I'm still looking, as it hits our production environment during socket issues.

If you can reliably get the problem, perhaps I can make a redis build you could use that has extra logging in place, so we get more information about what's happening during this error?

furlongmt · 2020-06-08T16:16:31Z

Hi @Plasma, yes it is the "Failed to write" and "Concurrent reads or writes are not supported error".

It seems to trigger periodically, but I'm not sure at this point if it's due to a socket disconnection or not yet.

I also have not been able to reproduce the program locally, we only see it in our production environment from time to time.

bdlee420 · 2020-06-09T17:02:51Z

Could the latest release possibly fix this? I see it was just released recently: https://stackexchange.github.io/StackExchange.Redis/ReleaseNotes#2155

We are also having this issue during regression load test runs. We are going to run ours tests against 2.1.55 and see if it is resolved. I should know by tomorrow.

mgravell · 2020-06-09T18:32:03Z

It isn't a specific target of that release. I'm hoping to look into connection stability issues now that I've freed up my plate a little - should be starting this week.

…

On Tue, 9 Jun 2020, 18:03 Brian Lee, ***@***.***> wrote: Could the latest release possibly fix this? I see it was just released recently: https://stackexchange.github.io/StackExchange.Redis/ReleaseNotes#2155 We are also having this issue during regression load test runs. We are going to run ours tests against 2.1.55 and see if it is resolved. I should know by tomorrow. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1438 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAEHMFWHZFRQ2UFNDKOEE3RVZTMVANCNFSM4MLDWLWA> .

Plasma · 2020-06-09T21:23:37Z

@bdlee420 If you are able to reliably reproduce the issue using your tests, wondering if you'd be keen to install a redis build that has some extra debug message code for a test CI build that I can add, to help print some variables, that may help track down the cause of this problem? Thanks

bdlee420 · 2020-06-10T14:20:44Z

I received the results from the test run with 2.1.55 and it is mixed. We didn't get any concurrency errors but instead received 28 timeouts in a 20 min window:

StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=3385KiB, inbound=11KiB, 5063ms elapsed, timeout is 5000ms), command=DEL, next: HGETALL somekey, inst: 0, qu: 0, qs: 7315, aw: True, rs: DequeueResult, ws: Writing, in: 65536, serverEndpoint: someserver, mc: 1/1/0, mgr: 0 of 10 available, clientName: someclientname, PerfCounterHelperkeyHashSlot: 9024, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=2552,Free=30215,Min=4,Max=32767), v: 2.1.55.31085

Now I have a couple options:

Go back to 2.1.30 and see if the concurrency errors come back
Go back to 2.0.519 and see if all errors are clean again

I am sort of leaning towards #1 so I can make sure what I am seeing is consistent, it is just painful as these regression tests take several hours :).

@Plasma, I am definitely OK with installing a preview edition with some additional logging and trying it out.

bdlee420 · 2020-06-10T15:55:41Z

I decided to go ahead and run the tests on 2.1.30 again.

Some further notes, all tests have been run against an API that is running on .NET Framework 4.6.1. We have a .NET Core 2.2 version that we have just released and we will now start testing on that version as well. I will post the results once I get them, it will probably take a couple days.

Plasma · 2020-06-11T06:28:02Z

@bdlee420 I have put together a fork at https://github.com/StackExchange/StackExchange.Redis/compare/master...Plasma:v2.1.30-debug-logs?w=1 that is based off v2.1.30 tag, that includes a bit more info that may help us track down the issue next time it re-occurs.

The tree is at https://github.com/Plasma/StackExchange.Redis/tree/v2.1.30-debug-logs

If you were able to reproduce this in your CI environment, perhaps you can pull my branch above, do a Release build, and use that binary for CI, and hopefully the issue reproduces with a bit more detailed info?

JKurzer · 2020-06-22T15:55:54Z

Can I get a pulse on this?

owennw · 2020-06-22T15:58:10Z

Also suffering this error in a production environment, would appreciate an update!

arsnyder16 · 2020-08-11T12:46:27Z

I am also seeing this in our production environment,

@bdlee420 @jessieli-ad You both seem to indicate the 2.0.519/2.0.513 may not suffer from the issue, have you rolled back your version? I am thinking that is the best option until this is resolved.

jessieli-ad · 2020-08-11T16:51:32Z

@arsnyder16 , we rolled back to 2.0.513. But 2.0.513 has timeout problem. We are looking for upgrade when this issue is resolved.

arsnyder16 · 2020-08-11T19:59:45Z

@jessieli-ad Did you try 2.0.588 by chance?

Release notes seem to indicate a fix around timeouts

https://github.com/StackExchange/StackExchange.Redis/blob/main/docs/ReleaseNotes.md#20588

arsnyder16 · 2020-08-12T13:15:30Z

We tried 2.0.588 and that seemed to have the following issue:
StackExchange.Redis.RedisConnectionException: No connection is available to service this operation: SETEX

I am playing version Whac-A-Mole

arsnyder16 · 2020-08-12T13:21:44Z

@Plasma Any update from your end? We are working to stabilize our production environment first, then we might be able to help you with your forked branch and repro the issue and get you the extra debug information.

We could probably get to this quicker if you were able to publish this fork to nuget for us to pull, it will be easier for us to tweak our deployment to pull that version than to manually try to inject it

Plasma · 2020-08-16T07:29:30Z

Hey @arsnyder16 I'm only encountering these kinds of errors on the bi-monthly (?) moments Azure introduces networking/connectivity/failover issues, its rare but when it happens its regular.

I hear you about the nuget feed, ill have a think. Until then, to help stabilize your prod env if you are still having issues, can I suggest also ensuring you have keep-alives and Pings turned on (as part of redis connection options) to ensure idle sockets aren't disconnected (which can lead to these kinds of issues too, I think).

Typically our environment is stable, its just during failover/networking issues where things can become interrupted.

arsnyder16 · 2020-08-17T13:27:18Z

@Plasma Thanks for the advice, we do not have keep alive configured. Do you have a recommendation on a good ping interval ? Is there a reason this is the default behavior to make sure these sockets don't become stale?

Plasma · 2020-08-17T23:57:26Z

You want to turn on keep-alive as part of ConfigurationOptions:

var options = ConfigurationOptions.Parse(...);
options.KeepAlive = 10; // time in seconds

We have ours set to 10 seconds. If you're running in Azure (or other cloud providers as well), or things are configured to be this way, servers/infrastructure can quietly disconnect connections that are idle and aren't receiving traffic.

The keep-alive above actually sends a redis PING command down the socket to make it look busy and health check it.

It can be a problem if you have periods of idleness in your app where Redis may not be used, so the connection is idled/disconnected, then your app goes to use it and starts getting errors.

I don't know if that's what you are experiencing, but my suggestion for best practice would be to have that enabled, and of course check that you aren't overloading your redis server/s in terms of Mbps or CPU.

arsnyder16 · 2020-08-18T14:30:35Z

@Plasma Thank you, this is very helpful we are going to set KeepAlive. we are using azure redis cache so it seems likely this will help us out.

Plasma · 2020-09-09T23:54:37Z

@arsnyder16 out of interest did that happen to help your prod environment?

dgunderson · 2020-09-21T16:21:42Z

@Plasma I'm new to this thread but we experienced this a couple weeks ago. I implemented the KeepAlive change like you suggested above when I first stumbled upon this thread, but this morning processes that use SE.Redis (run from docker images in containers on a bare-metal linux machine we have) went back to throwing these errors over and over. Everything was fixed after I executed docker-compose down/up on all the services using StackExchange.Redis to our instance.

We do have quite a few processes writing to and reading from this Redis instance pretty much all the time. Looking at what you said above, is this not supported?

I don't know if that's what you are experiencing, but my suggestion for best practice would be to have that enabled, and of course check that you aren't overloading your redis server/s in terms of Mbps or CPU.

Btw we're presently on 2.1.28

Plasma · 2020-09-22T06:42:20Z

Hi @dgunderson the keep-alive suggestion was specifically for arsnyder16 who was getting some other kind of error, but the original error in this thread (about the never-recovering concurrent read/write unsupported) is far as I know still a potential issue - we are planning to update to the latest SE.Redis in the future but we haven't had it re-occur yet. Restarting the process is my only known fix, too.

rayansinha · 2020-09-28T08:46:17Z

@Plasma @mgravell Do we have any fix for this issue? we are getting this error and its holding us back from Release. We are using docker to build the images and deploying it on Azure Kubernetes Service.

arsnyder16 · 2020-10-09T12:48:58Z

@Plasma Sorry for the delay, it took us a little to get this into production, but no it did not help us we are still seeing this.

We are using 2.1.58, on Azure Cache For Redis, based on the server metrics, it appears there is very little load on the server at that time.

we got roughly 425 of this with a 5 minute just this morning around 7amET

StackExchange.Redis.RedisConnectionException: Failed to write ---> 
System.InvalidOperationException: Concurrent reads or writes are not supported. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
 at System.IO.Pipelines.PipeCompletion.ThrowLatchedException() 
 at System.IO.Pipelines.Pipe.GetFlushResult(FlushResult& result) 
 at System.IO.Pipelines.Pipe.PrepareFlush(CompletionData& completionData, ValueTask`1& result, CancellationToken cancellationToken)
 at System.IO.Pipelines.Pipe.FlushAsync(CancellationToken cancellationToken) 
 at System.IO.Pipelines.Pipe.DefaultPipeWriter.FlushAsync(CancellationToken cancellationToken) 
 at StackExchange.Redis.PhysicalConnection.FlushAsync(Boolean throwOnFailure) 
 at StackExchange.Redis.PhysicalBridge.WriteMessageTakingWriteLockAsync(PhysicalConnection physical, Message message) 
 --- End of inner exception stack trace ---
 at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() 
 at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

we are also saw this around the same time, not that high of a count though ~15 times

StackExchange.Redis.RedisConnectionException: An unknown error occurred when writing the message 
 at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) 
 at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server)

arsnyder16 · 2020-10-09T12:58:43Z

looking closer at our logs, looks like something we missed was some of our timeout we are logging as warnings so i was missing them around that same time we are ssing

StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=1KiB, 2500ms elapsed, timeout is 2500ms), command=SETEX, next: SETEX _redacted_, inst: 0, qu: 497, qs: 684, aw: True, bw: Inactive, rs: ReadAsync, ws: Flushing, in: 1092, serverEndpoint: _redacted_, mc: 1/1/0, mgr: 9 of 10 available, clientName: _redacted_, IOCP: (Busy=2,Free=998,Min=300,Max=1000), WORKER: (Busy=3,Free=32764,Min=250,Max=32767), v: 2.1.58.34321 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

StackExchange.Redis.RedisTimeoutException: Timeout performing MGET (2500ms), next: SETEX _redacted_, inst: 0, qu: 530, qs: 684, aw: True, bw: Inactive, rs: ReadAsync, ws: Flushing, in: 1092, serverEndpoint: _redacted_, mc: 1/1/0, mgr: 9 of 10 available, clientName: _redacted_, IOCP: (Busy=2,Free=998,Min=300,Max=1000), WORKER: (Busy=0,Free=32767,Min=250,Max=32767), v: 2.1.58.34321 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
   at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server)
   at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server)

Plasma · 2020-10-11T23:35:51Z

Out of interest if you load up Azure portal metrics for your cache instance, set the metric to Errors, if you then Split the metric in the UI, does it say a Failover happened around that time?

In our experience the 'Concurrent reads or writes are not supported' error (which we have not been able to fix) happens as a side-effect of some connectivity issue. It's only happened twice, and sometimes a failover happens without issue, seems to be a race condition that can cause the problem - and I don't have a solution (I'm just a user of the library, not a maintainer).

I did try and reproduce this and step through the code but was unable to reproduce it. Earlier comments in this issue suggested it was a downstream library issue that was rolled back in later versions of SE.Redis, but we haven't upgraded just yet.

arsnyder16 · 2020-10-12T00:32:41Z

@Plasma I am not seeing a correlation. So over the past 30 days this has happen 19 times, there have been 6 redis errors all failovers the closest to the 'Concurrent reads or writes' errors was 3 hours, otherwise they were at least 12 hours apart

ghost · 2020-10-12T01:04:13Z

This is a client-side race condition and is caused by a regression post 2.0.513. Right now, the easiest resolution is to walk back to the earlier version, but that version has non-trivial issues in thread management among other things and has its own time-out problems as a result. While those don't cascade, they are pretty rough. Right now, we're really at an impasse re: deployability. The earliest incidence of the issue I'm aware of is in 2.0.601 and while that's a significant number of code changes, it's a finite number.

If it would help, I'm willing to put up a bug bounty.

--J

arsnyder16 · 2020-10-13T23:06:05Z

@sassywarsat @Plasma

I was able to put together a console app that produces the issue. I haven't begun debugging yet, but i thought i would pass along in case someone more familiar with the code base would also be interesting in jumping in. It is not super consistent and some of the things might need adjust per machine. I would recommend running under Release configuration.

Seems to occur around the time there start become a lot of TimeoutExceptions.

Attached is a cs file, i am using v2.1.58, with an azure redis cache, Standard 2.5 GB
Program.zip

arsnyder16 · 2020-10-14T17:50:59Z

@Plasma @sassywarsat @mgravell

I think I got the root of the issue, I have submitted a PR #1585

Plasma mentioned this issue Jun 11, 2020

Temporary PR to help add debug code to diagnose concurrent read/write issue #1496

Closed

arsnyder16 mentioned this issue Oct 14, 2020

fix for mutex slim being released too early during flush operation #1585

Merged

mgravell closed this as completed in #1585 Nov 13, 2020

nmccalme mentioned this issue Feb 24, 2021

Multi-target .NET Core 3.1 / .NET Core 5.0 for latest package versions (2.2.x) #1697

Closed

Get & Set Failing After Error Cascade #1438

Get & Set Failing After Error Cascade #1438

Comments

JKurzer commented Apr 18, 2020 • edited Loading

The Problem

Additional

Set Error:

Get Error:

JKurzer commented Apr 18, 2020

mgravell commented Apr 18, 2020 via email

JonCole commented Apr 20, 2020

JKurzer commented Apr 20, 2020 • edited Loading

Timeout Error:

JKurzer commented Apr 21, 2020 • edited Loading

JKurzer commented Apr 22, 2020 • edited Loading

sspeaks610 commented Apr 24, 2020 • edited Loading

JKurzer commented Apr 24, 2020 • edited Loading

mgravell commented Apr 24, 2020 via email

JKurzer commented Apr 24, 2020 • edited Loading

JKurzer commented Apr 30, 2020 • edited Loading

Plasma commented May 11, 2020 • edited Loading

Plasma commented May 13, 2020

Plasma commented May 18, 2020 • edited Loading

jessieli-ad commented May 19, 2020

JKurzer commented May 20, 2020

JKurzer commented May 21, 2020

furlongmt commented Jun 3, 2020

Plasma commented Jun 4, 2020

perosb commented Jun 5, 2020

Plasma commented Jun 7, 2020

furlongmt commented Jun 8, 2020

bdlee420 commented Jun 9, 2020

mgravell commented Jun 9, 2020 via email

Plasma commented Jun 9, 2020

bdlee420 commented Jun 10, 2020

bdlee420 commented Jun 10, 2020 • edited Loading

Plasma commented Jun 11, 2020 • edited Loading

JKurzer commented Jun 22, 2020

owennw commented Jun 22, 2020

arsnyder16 commented Aug 11, 2020

jessieli-ad commented Aug 11, 2020

arsnyder16 commented Aug 11, 2020

arsnyder16 commented Aug 12, 2020

arsnyder16 commented Aug 12, 2020 • edited Loading

Plasma commented Aug 16, 2020

arsnyder16 commented Aug 17, 2020

Plasma commented Aug 17, 2020

arsnyder16 commented Aug 18, 2020

Plasma commented Sep 9, 2020

dgunderson commented Sep 21, 2020 • edited Loading

Plasma commented Sep 22, 2020

rayansinha commented Sep 28, 2020

arsnyder16 commented Oct 9, 2020 • edited Loading

arsnyder16 commented Oct 9, 2020 • edited Loading

Plasma commented Oct 11, 2020

arsnyder16 commented Oct 12, 2020

ghost commented Oct 12, 2020

arsnyder16 commented Oct 13, 2020 • edited Loading

arsnyder16 commented Oct 14, 2020

JKurzer commented Apr 18, 2020 •

edited

Loading

JKurzer commented Apr 20, 2020 •

edited

Loading

JKurzer commented Apr 21, 2020 •

edited

Loading

JKurzer commented Apr 22, 2020 •

edited

Loading

sspeaks610 commented Apr 24, 2020 •

edited

Loading

JKurzer commented Apr 24, 2020 •

edited

Loading

JKurzer commented Apr 24, 2020 •

edited

Loading

JKurzer commented Apr 30, 2020 •

edited

Loading

Plasma commented May 11, 2020 •

edited

Loading

Plasma commented May 18, 2020 •

edited

Loading

bdlee420 commented Jun 10, 2020 •

edited

Loading

Plasma commented Jun 11, 2020 •

edited

Loading

arsnyder16 commented Aug 12, 2020 •

edited

Loading

dgunderson commented Sep 21, 2020 •

edited

Loading

arsnyder16 commented Oct 9, 2020 •

edited

Loading

arsnyder16 commented Oct 9, 2020 •

edited

Loading

arsnyder16 commented Oct 13, 2020 •

edited

Loading