Skip to content

Get & Set Failing After Error Cascade #1438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JKurzer opened this issue Apr 18, 2020 · 50 comments · Fixed by #1585
Closed

Get & Set Failing After Error Cascade #1438

JKurzer opened this issue Apr 18, 2020 · 50 comments · Fixed by #1585

Comments

@JKurzer
Copy link

JKurzer commented Apr 18, 2020

Aloha!

The Problem

Under certain loads and after long up-times, we've begun seeing a curious error during addorset (Reproduced at the end).
This is accompanied by a much higher number of errors for Gets (Reproduced at the end) that are roughly identical, though we do use get far more than set.

Additional

Once a client becomes symptomatic, it rapidly begins throwing these errors on most or all operations. Bouncing that client causes another client to become symptomatic in a similar way, which may be a red herring but was interesting enough to bear mention.

We currently use only one multiplexer, which seems like a possible cause, but before we refactor, I thought I'd check in. Our application is pretty much pure C#. We are seeing time-outs as well, during symptomatic periods. My assumption is that StackExchange.Redis/ConnectionMultiplexer.cs#L2601 is the top level call in the multiplexer for the error, given that 2622 is part of error handling deferral.

Set Error:

REDACTED REDIS AddOrSet Failure on key REDACTED
    StackExchange.Redis.RedisConnectionException: Failed to write ---> System.InvalidOperationException: Concurrent reads or writes are not supported.
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.IO.Pipelines.PipeCompletion.ThrowLatchedException()
       at System.IO.Pipelines.Pipe.GetFlushResult(FlushResult& result)
       at System.IO.Pipelines.Pipe.PrepareFlush(CompletionData& completionData, ValueTask`1& result, CancellationToken cancellationToken)
       at System.IO.Pipelines.Pipe.FlushAsync(CancellationToken cancellationToken)
       at System.IO.Pipelines.Pipe.DefaultPipeWriter.FlushAsync(CancellationToken cancellationToken)
       at Pipelines.Sockets.Unofficial.SocketConnection.WrappedWriter.FlushAsync(CancellationToken cancellationToken) in /_/src/Pipelines.Sockets.Unofficial/SocketConnection.cs:line 434
       at StackExchange.Redis.PhysicalConnection.FlushAsync(Boolean throwOnFailure) in /_/src/StackExchange.Redis/PhysicalConnection.cs:line 907
       at StackExchange.Redis.PhysicalBridge.ProcessBacklog() in /_/src/StackExchange.Redis/PhysicalBridge.cs:line 889
       --- End of inner exception stack trace ---
       at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 2622
       at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) in /_/src/StackExchange.Redis/RedisBase.cs:line 54

Get Error:

REDACTED REDIS GetAsync Failure on key REDACTED
    StackExchange.Redis.RedisConnectionException: Failed to write ---> System.InvalidOperationException: Concurrent reads or writes are not supported.
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.IO.Pipelines.PipeCompletion.ThrowLatchedException()
       at System.IO.Pipelines.Pipe.GetFlushResult(FlushResult& result)
       at System.IO.Pipelines.Pipe.PrepareFlush(CompletionData& completionData, ValueTask`1& result, CancellationToken cancellationToken)
       at System.IO.Pipelines.Pipe.FlushAsync(CancellationToken cancellationToken)
       at System.IO.Pipelines.Pipe.DefaultPipeWriter.FlushAsync(CancellationToken cancellationToken)
       at Pipelines.Sockets.Unofficial.SocketConnection.WrappedWriter.FlushAsync(CancellationToken cancellationToken) in //src/Pipelines.Sockets.Unofficial/SocketConnection.cs:line 434
       at StackExchange.Redis.PhysicalConnection.FlushAsync(Boolean throwOnFailure) in //src/StackExchange.Redis/PhysicalConnection.cs:line 907
       at StackExchange.Redis.PhysicalBridge.WriteMessageTakingWriteLockAsync(PhysicalConnection physical, Message message) in /_/src/StackExchange.Redis/PhysicalBridge.cs:line 987
       --- End of inner exception stack trace ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
@JKurzer
Copy link
Author

JKurzer commented Apr 18, 2020

If the timeout errors would be helpful, I can provide them too. Redacting them is more timestaking, and I wanted to get this up.

@mgravell
Copy link
Collaborator

mgravell commented Apr 18, 2020 via email

@JonCole
Copy link
Contributor

JonCole commented Apr 20, 2020

FYI - we have a customer getting this problem on version 2.1.30.38891. I will see if we can any details on how to reproduce this...

@JKurzer
Copy link
Author

JKurzer commented Apr 20, 2020

We are also on v: 2.1.30.38891. Our repro is inconsistent at best, and only occurs after a soak under production load levels. Unfortunately, we can't go live until I have a clean line to resolve. It does seem likely that this is happening during .net threadpool exhaustion.

We are also seeing fairly regular time-outs (reproduced at the end) from symptomatic machines prior to and during the fatal cascade. I believe this is an orthogonal issue, but I mention it for completeness. These appear to be both due to and causing the threadpool exhaustion, as indicated in WORKER. Our redis metrics and slowlog do not report issues.

Timeout Error:

REDIS GetAsync Failure on key REDACTED
    StackExchange.Redis.RedisTimeoutException: The timeout was reached before the message could be written to the output buffer, and it was not sent, command=GET,
timeout: 5000, outbound: 9293KiB, inbound: 0KiB, next: SETEX REDACTED, inst: 20, qu: 192, qs: 14, aw: True, bw: RecordingTimeout, rs: ReadAsync, ws: Flushed, in: 0, in-pipe: 0, out-pipe: 243792,
serverEndpoint: 10.127.11.35:6001, mc: 1/1/0, mgr: 9 of 10 available, clientName: REDACTED, PerfCounterHelperkeyHashSlot: 11679, IOCP: (Busy=1,Free=999,Min=24,Max=1000), WORKER: (Busy=56,Free=32711,Min=24,Max=32767),
v: 2.1.30.38891 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at REDACTED.Engine.Common.Caching.RedisCache.GetAsync()

@JKurzer
Copy link
Author

JKurzer commented Apr 21, 2020

I may have a small additional lead. We're seeing another form of the same error, much rarer but a bit more informative maybe, since it shows what's happening to the task.

Unobserved Task Exception caught, sender is System.Threading.Tasks.Task1[StackExchange.Redis.WriteResult]: 
    System.AggregateException: A Task's exception(s) were not observed either by Waiting on the Task or accessing its Exception property. As a result, the unobserved exception was rethrown by the finalizer thread. ---> System.InvalidOperationException: Concurrent reads or writes are not supported.
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.IO.Pipelines.PipeCompletion.ThrowLatchedException()
       at System.IO.Pipelines.Pipe.GetFlushResult(FlushResult& result)
       at System.IO.Pipelines.Pipe.PrepareFlush(CompletionData& completionData, ValueTask1& result, CancellationToken cancellationToken)
       at System.IO.Pipelines.Pipe.FlushAsync(CancellationToken cancellationToken)
       at System.IO.Pipelines.Pipe.DefaultPipeWriter.FlushAsync(CancellationToken cancellationToken)
       at Pipelines.Sockets.Unofficial.SocketConnection.WrappedWriter.FlushAsync(CancellationToken cancellationToken) in //src/Pipelines.Sockets.Unofficial/SocketConnection.cs:line 434
       at StackExchange.Redis.PhysicalConnection.FlushAsync(Boolean throwOnFailure) in //src/StackExchange.Redis/PhysicalConnection.cs:line 907
       at StackExchange.Redis.PhysicalBridge.ProcessBacklog() in //src/StackExchange.Redis/PhysicalBridge.cs:line 889
       --- End of inner exception stack trace ---
    ---> (Inner Exception #0) System.InvalidOperationException: Concurrent reads or writes are not supported.
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.IO.Pipelines.PipeCompletion.ThrowLatchedException()
       at System.IO.Pipelines.Pipe.GetFlushResult(FlushResult& result)
       at System.IO.Pipelines.Pipe.PrepareFlush(CompletionData& completionData, ValueTask`1& result, CancellationToken cancellationToken)
       at System.IO.Pipelines.Pipe.FlushAsync(CancellationToken cancellationToken)
       at System.IO.Pipelines.Pipe.DefaultPipeWriter.FlushAsync(CancellationToken cancellationToken)
       at Pipelines.Sockets.Unofficial.SocketConnection.WrappedWriter.FlushAsync(CancellationToken cancellationToken) in //src/Pipelines.Sockets.Unofficial/SocketConnection.cs:line 434
       at StackExchange.Redis.PhysicalConnection.FlushAsync(Boolean throwOnFailure) in //src/StackExchange.Redis/PhysicalConnection.cs:line 907
       at StackExchange.Redis.PhysicalBridge.ProcessBacklog() in //src/StackExchange.Redis/PhysicalBridge.cs:line 889

@JKurzer
Copy link
Author

JKurzer commented Apr 22, 2020

Small update:

Once a machine is symptomatic, we see extremely rapid repeated concurrent read or write errors on the same key, as it looks like the metastable task is getting repeatedly flushed.

A failure mode is always preceded by a series of time-outs, for the keys that then are then repeated endlessly in the concurrency errors.

@sspeaks610
Copy link

sspeaks610 commented Apr 24, 2020

If this is a regression, It might be related to the recent update to 2.1.11 for Pipelines.Sockets.Unofficial.

mgravell/Pipelines.Sockets.Unofficial@3f1a6d9

The new version of the package updates MutexSlim which is used by PhysicalBridge appearing in the stack trace above.

This is just speculation, and I will look in to it further when I get a chance.

@JKurzer
Copy link
Author

JKurzer commented Apr 24, 2020

We've a rollback headed into test shortly. It is possible that the misbehavior is a fairness bug, actually.

@mgravell
Copy link
Collaborator

mgravell commented Apr 24, 2020 via email

@JKurzer
Copy link
Author

JKurzer commented Apr 24, 2020

Thanks. Let me know if you need us to turn on any logging or add any instrumentation - we don't have a true minrepro but we can reproduce it reliably.

I've resisted speculating so far, but it does seem like something is letting a task get sliced in a way that causes it to be both failed and flushed, which then triggers the transparent retry handling.

Because this only arises during a wave of simultaneous time-outs, it's likely there's some contention in this circumstance that's causing a fatal violation of assumed execution order where either a guarantee has failed or a guarantee did not actually exist.

@JKurzer
Copy link
Author

JKurzer commented Apr 30, 2020

@mgravell : Comparable issue occurs with rolled back version.

SE Redis Client is 2.0.601
pipelines.sockets.unofficial is 2.0.22
.net 4.72~

Error cascade of:

REDACTED REDIS GetAsync Failure on key REDACTED
    StackExchange.Redis.RedisConnectionException: No connection is available to service this operation: GET REDACTED; IOCP: (Busy=0,Free=1000,Min=256,Max=1000), WORKER: (Busy=6,Free=32761,Min=256,Max=32767), Local-CPU: 0%
       at StackExchange.Redis.ConnectionMultiplexer.ThrowFailedT in C:\projects\stackexchange-redis\src\StackExchange.Redis\ConnectionMultiplexer.cs:line 2201
    --- End of stack trace from previous location where exception was thrown ---
       at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
       at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
       at REDACTED.Engine.Common.Caching.RedisCache.GetAsync() in d:\REDACTED\next\REDACTED.Engine.Common\Caching\RedisCache.cs:line 708

Looks like it could be thread theft rather than the same pathology, but it does mean that rollback isn't an option for us, unfortunately. Anything we can do to help?

@Plasma
Copy link

Plasma commented May 11, 2020

I've just seen this happen in our environment using SE.Redis 2.1.0.

This began around the time a Redis server failed over to a replica and so there were connection interruptions. This connection has remained in a broken state and not recovered for 15+ minutes (application was restarted to fix).

We got these kinds of messages:

Timeout awaiting response (outbound=228KiB, inbound=9393KiB, 10875ms elapsed, timeout is 10000ms), command=EVAL, next: ZADD XXXX, inst: 0, qu: 1568, qs: 77358, aw: True, bw: Flushing, rs: ReadAsync, ws: Flushing, in: 0, serverEndpoint: XXXX, mgr: 9 of 10 available, clientName: WorkerRole_IN_23, IOCP: (Busy=0,Free=1000,Min=10,Max=1000), WORKER: (Busy=3,Free=32764,Min=10,Max=32767), v: 2.1.0.1 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
Message: Concurrent reads or writes are not supported.
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw():12
System.IO.Pipelines.PipeCompletion.ThrowLatchedException():11
System.IO.Pipelines.Pipe.GetFlushResult(FlushResult& result):24
System.IO.Pipelines.Pipe.PrepareFlush(CompletionData& completionData, ValueTask`1& result, CancellationToken cancellationToken):54
System.IO.Pipelines.Pipe.FlushAsync(CancellationToken cancellationToken):17
System.IO.Pipelines.Pipe+DefaultPipeWriter.FlushAsync(CancellationToken cancellationToken)
 
StackExchange.Redis.PhysicalConnection.FlushAsync(Boolean throwOnFailure) in C:\projects\stackexchange-redis\src\StackExchange.Redis\PhysicalConnection.cs:907
StackExchange.Redis.PhysicalConnection.FlushSync(Boolean throwOnFailure, Int32 millisecondsTimeout) in C:\projects\stackexchange-redis\src\StackExchange.Redis\PhysicalConnection.cs:879
StackExchange.Redis.PhysicalBridge.WriteMessageTakingWriteLockSync(PhysicalConnection physical, Message message) in C:\projects\stackexchange-redis\src\StackExchange.Redis\PhysicalBridge.cs:736

Message: Concurrent reads or writes are not supported.
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw():12
System.IO.Pipelines.PipeCompletion.ThrowLatchedException():11
System.IO.Pipelines.Pipe.GetFlushResult(FlushResult& result):24
System.IO.Pipelines.Pipe.PrepareFlush(CompletionData& completionData, ValueTask`1& result, CancellationToken cancellationToken):54
System.IO.Pipelines.Pipe.FlushAsync(CancellationToken cancellationToken):17
System.IO.Pipelines.Pipe+DefaultPipeWriter.FlushAsync(CancellationToken cancellationToken)
 
StackExchange.Redis.PhysicalConnection.FlushAsync(Boolean throwOnFailure) in C:\projects\stackexchange-redis\src\StackExchange.Redis\PhysicalConnection.cs:907
StackExchange.Redis.PhysicalConnection.FlushSync(Boolean throwOnFailure, Int32 millisecondsTimeout) in C:\projects\stackexchange-redis\src\StackExchange.Redis\PhysicalConnection.cs:879
StackExchange.Redis.PhysicalBridge.ProcessBacklog() in C:\projects\stackexchange-redis\src\StackExchange.Redis\PhysicalBridge.cs:888
Message: SocketClosed on XXX/Interactive, Flushing/MarkProcessed, last: EVAL, origin: ReadFromPipe, outstanding: 158705, last-read: 0s ago, last-write: 39s ago, keep-alive: 10s, state: ConnectedEstablished, mgr: 9 of 10 available, in: 0, last-heartbeat: 1s ago, last-mbeat: 1s ago, global: 0s ago, v: 2.1.0.1

@Plasma
Copy link

Plasma commented May 13, 2020

This happened again for us today, 3 separate machines, shortly after a redis-side socket disconnection occurred as above. Same stack traces.

@Plasma
Copy link

Plasma commented May 18, 2020

I've been trying to reproduce this and isolate the problem but its providing difficult.

I tried injecting random faults in the connection and lock management code to try and get into a broken state, but no luck.

I do note and wonder if this is a potential issue that could get the pipes concurrently flushing though, but could not reproduce it:

  1. The stack traces above have ProcessBacklog and WriteMessageTakingWriteLockSync competing threads, both somehow in PhysicalConnection.FlushSync (sync, not async)

  2. The lock acquire/release code seems correct from my testing, so I don't think a lock is being mis-taken/not-taken

  3. In PhysicalConnection.FlushSync, which both stack traces call, if a flush times out, a task.Wait() gives up and throws an exception. When bubbling up, this would release the lock held by the writer.

See https://github.com/StackExchange/StackExchange.Redis/blob/master/src/StackExchange.Redis/PhysicalConnection.cs#L882

However, the writer could still be trying to progress (the timeout didn't cancel the write, its proceeding in the background). At this point, another thread could acquire the lock and try another write, leading to concurrent writes to the pipe.

@jessieli-ad
Copy link

We are also seeing this issue consistently right after we upgraded from 2.0.513 to 2.1.30. Will this issue be fixed soon? If not, we will consider to roll back to the old version.

@JKurzer
Copy link
Author

JKurzer commented May 20, 2020

For a short-term fix, to continue with testing, we now teardown the multiplexer when we see the error. The error only arises every few hours, and the teardown is a low cost. It's not acceptable for production operation, however, and I'd still like to see this fixed.

The fact that this resolves the issue is interesting in and of itself.

@JKurzer
Copy link
Author

JKurzer commented May 21, 2020

Bumping this. Unfortunately, it's becoming a blocking issue for us to deploy to production. Are there next steps we can take that would assist in reproduction? We have some load-test code that induces the fault reliably after a prolonged soak, but I'm not sure how we'd grant you access to it since it hinges on our cache implementation.

I'm open to suggestions.

@furlongmt
Copy link

Are there any updates on this issue? We're also running into this issue under prod load with version 2.1.30.

@Plasma
Copy link

Plasma commented Jun 4, 2020

@mgravell besides #1438 (comment) thoughts about potential issue, another thought is that perhaps the locking code is correct for PhysicalBridge, rather there just happens (due to disconnect/reconnect cycle) now two or more instances re-using the same IOPipe _ioPipe variable as found in the referenced PhysicalConnection?

Is it possible perhaps during connection issues and reconnect logic that a new physical connection is re-created that is given the same _ioPipe reference as another instance, and so both think they have valid locks, but are accidentally sharing the same resource and this is why this is happening.

I see code in ConnectionMultiplexer for example that has Timers and WeakReferences to things, wondering if there's a moment during disconnect/GC/reconnect/queued commands/etc that somehow > 1 instance references the same _ioPipe in PhysicalConnection.

@perosb
Copy link

perosb commented Jun 5, 2020

is downgrading Redis.dll the workaround for this?
is it only redis.dll or also its dependencies?

@Plasma
Copy link

Plasma commented Jun 7, 2020

Hi @furlongmt how often/reliably are you seeing this error, and can I confirm, is it the "Failed to write" and "Concurrent reads or writes are not supported" error?

Do you get it when there is a socket issue/disconnection?

I've tried to reproduce this multiple times by editing the code base and injecting faults, but not able to get this exact issue. I'm still looking, as it hits our production environment during socket issues.

If you can reliably get the problem, perhaps I can make a redis build you could use that has extra logging in place, so we get more information about what's happening during this error?

@furlongmt
Copy link

Hi @Plasma, yes it is the "Failed to write" and "Concurrent reads or writes are not supported error".

It seems to trigger periodically, but I'm not sure at this point if it's due to a socket disconnection or not yet.

I also have not been able to reproduce the program locally, we only see it in our production environment from time to time.

@bdlee420
Copy link

bdlee420 commented Jun 9, 2020

Could the latest release possibly fix this? I see it was just released recently: https://stackexchange.github.io/StackExchange.Redis/ReleaseNotes#2155

We are also having this issue during regression load test runs. We are going to run ours tests against 2.1.55 and see if it is resolved. I should know by tomorrow.

@mgravell
Copy link
Collaborator

mgravell commented Jun 9, 2020 via email

@Plasma
Copy link

Plasma commented Jun 9, 2020

@bdlee420 If you are able to reliably reproduce the issue using your tests, wondering if you'd be keen to install a redis build that has some extra debug message code for a test CI build that I can add, to help print some variables, that may help track down the cause of this problem? Thanks

@bdlee420
Copy link

I received the results from the test run with 2.1.55 and it is mixed. We didn't get any concurrency errors but instead received 28 timeouts in a 20 min window:

StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=3385KiB, inbound=11KiB, 5063ms elapsed, timeout is 5000ms), command=DEL, next: HGETALL somekey, inst: 0, qu: 0, qs: 7315, aw: True, rs: DequeueResult, ws: Writing, in: 65536, serverEndpoint: someserver, mc: 1/1/0, mgr: 0 of 10 available, clientName: someclientname, PerfCounterHelperkeyHashSlot: 9024, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=2552,Free=30215,Min=4,Max=32767), v: 2.1.55.31085

Now I have a couple options:

  1. Go back to 2.1.30 and see if the concurrency errors come back
  2. Go back to 2.0.519 and see if all errors are clean again

I am sort of leaning towards #1 so I can make sure what I am seeing is consistent, it is just painful as these regression tests take several hours :).

@Plasma, I am definitely OK with installing a preview edition with some additional logging and trying it out.

@bdlee420
Copy link

bdlee420 commented Jun 10, 2020

I decided to go ahead and run the tests on 2.1.30 again.

Some further notes, all tests have been run against an API that is running on .NET Framework 4.6.1. We have a .NET Core 2.2 version that we have just released and we will now start testing on that version as well. I will post the results once I get them, it will probably take a couple days.

@Plasma
Copy link

Plasma commented Jun 11, 2020

@bdlee420 I have put together a fork at https://github.com/StackExchange/StackExchange.Redis/compare/master...Plasma:v2.1.30-debug-logs?w=1 that is based off v2.1.30 tag, that includes a bit more info that may help us track down the issue next time it re-occurs.

The tree is at https://github.com/Plasma/StackExchange.Redis/tree/v2.1.30-debug-logs

If you were able to reproduce this in your CI environment, perhaps you can pull my branch above, do a Release build, and use that binary for CI, and hopefully the issue reproduces with a bit more detailed info?

@JKurzer
Copy link
Author

JKurzer commented Jun 22, 2020

Can I get a pulse on this?

@owennw
Copy link

owennw commented Jun 22, 2020

Also suffering this error in a production environment, would appreciate an update!

@arsnyder16
Copy link

I am also seeing this in our production environment,

@bdlee420 @jessieli-ad You both seem to indicate the 2.0.519/2.0.513 may not suffer from the issue, have you rolled back your version? I am thinking that is the best option until this is resolved.

@jessieli-ad
Copy link

@arsnyder16 , we rolled back to 2.0.513. But 2.0.513 has timeout problem. We are looking for upgrade when this issue is resolved.

@arsnyder16
Copy link

@jessieli-ad Did you try 2.0.588 by chance?

Release notes seem to indicate a fix around timeouts

https://github.com/StackExchange/StackExchange.Redis/blob/main/docs/ReleaseNotes.md#20588

@arsnyder16
Copy link

We tried 2.0.588 and that seemed to have the following issue:
StackExchange.Redis.RedisConnectionException: No connection is available to service this operation: SETEX

I am playing version Whac-A-Mole

@arsnyder16
Copy link

arsnyder16 commented Aug 12, 2020

@Plasma Any update from your end? We are working to stabilize our production environment first, then we might be able to help you with your forked branch and repro the issue and get you the extra debug information.

We could probably get to this quicker if you were able to publish this fork to nuget for us to pull, it will be easier for us to tweak our deployment to pull that version than to manually try to inject it

@Plasma
Copy link

Plasma commented Aug 16, 2020

Hey @arsnyder16 I'm only encountering these kinds of errors on the bi-monthly (?) moments Azure introduces networking/connectivity/failover issues, its rare but when it happens its regular.

I hear you about the nuget feed, ill have a think. Until then, to help stabilize your prod env if you are still having issues, can I suggest also ensuring you have keep-alives and Pings turned on (as part of redis connection options) to ensure idle sockets aren't disconnected (which can lead to these kinds of issues too, I think).

Typically our environment is stable, its just during failover/networking issues where things can become interrupted.

@arsnyder16
Copy link

@Plasma Thanks for the advice, we do not have keep alive configured. Do you have a recommendation on a good ping interval ? Is there a reason this is the default behavior to make sure these sockets don't become stale?

@Plasma
Copy link

Plasma commented Aug 17, 2020

You want to turn on keep-alive as part of ConfigurationOptions:

var options = ConfigurationOptions.Parse(...);
options.KeepAlive = 10; // time in seconds

We have ours set to 10 seconds. If you're running in Azure (or other cloud providers as well), or things are configured to be this way, servers/infrastructure can quietly disconnect connections that are idle and aren't receiving traffic.

The keep-alive above actually sends a redis PING command down the socket to make it look busy and health check it.

It can be a problem if you have periods of idleness in your app where Redis may not be used, so the connection is idled/disconnected, then your app goes to use it and starts getting errors.

I don't know if that's what you are experiencing, but my suggestion for best practice would be to have that enabled, and of course check that you aren't overloading your redis server/s in terms of Mbps or CPU.

@arsnyder16
Copy link

@Plasma Thank you, this is very helpful we are going to set KeepAlive. we are using azure redis cache so it seems likely this will help us out.

@Plasma
Copy link

Plasma commented Sep 9, 2020

@arsnyder16 out of interest did that happen to help your prod environment?

@dgunderson
Copy link

dgunderson commented Sep 21, 2020

@Plasma I'm new to this thread but we experienced this a couple weeks ago. I implemented the KeepAlive change like you suggested above when I first stumbled upon this thread, but this morning processes that use SE.Redis (run from docker images in containers on a bare-metal linux machine we have) went back to throwing these errors over and over. Everything was fixed after I executed docker-compose down/up on all the services using StackExchange.Redis to our instance.

We do have quite a few processes writing to and reading from this Redis instance pretty much all the time. Looking at what you said above, is this not supported?

I don't know if that's what you are experiencing, but my suggestion for best practice would be to have that enabled, and of course check that you aren't overloading your redis server/s in terms of Mbps or CPU.

Btw we're presently on 2.1.28

@Plasma
Copy link

Plasma commented Sep 22, 2020

Hi @dgunderson the keep-alive suggestion was specifically for arsnyder16 who was getting some other kind of error, but the original error in this thread (about the never-recovering concurrent read/write unsupported) is far as I know still a potential issue - we are planning to update to the latest SE.Redis in the future but we haven't had it re-occur yet. Restarting the process is my only known fix, too.

@rayansinha
Copy link

@Plasma @mgravell Do we have any fix for this issue? we are getting this error and its holding us back from Release. We are using docker to build the images and deploying it on Azure Kubernetes Service.

@arsnyder16
Copy link

arsnyder16 commented Oct 9, 2020

@Plasma Sorry for the delay, it took us a little to get this into production, but no it did not help us we are still seeing this.

We are using 2.1.58, on Azure Cache For Redis, based on the server metrics, it appears there is very little load on the server at that time.

we got roughly 425 of this with a 5 minute just this morning around 7amET

StackExchange.Redis.RedisConnectionException: Failed to write ---> 
System.InvalidOperationException: Concurrent reads or writes are not supported. at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
 at System.IO.Pipelines.PipeCompletion.ThrowLatchedException() 
 at System.IO.Pipelines.Pipe.GetFlushResult(FlushResult& result) 
 at System.IO.Pipelines.Pipe.PrepareFlush(CompletionData& completionData, ValueTask`1& result, CancellationToken cancellationToken)
 at System.IO.Pipelines.Pipe.FlushAsync(CancellationToken cancellationToken) 
 at System.IO.Pipelines.Pipe.DefaultPipeWriter.FlushAsync(CancellationToken cancellationToken) 
 at StackExchange.Redis.PhysicalConnection.FlushAsync(Boolean throwOnFailure) 
 at StackExchange.Redis.PhysicalBridge.WriteMessageTakingWriteLockAsync(PhysicalConnection physical, Message message) 
 --- End of inner exception stack trace ---
 at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() 
 at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) 

we are also saw this around the same time, not that high of a count though ~15 times

StackExchange.Redis.RedisConnectionException: An unknown error occurred when writing the message 
 at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) 
 at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) 

@arsnyder16
Copy link

arsnyder16 commented Oct 9, 2020

looking closer at our logs, looks like something we missed was some of our timeout we are logging as warnings so i was missing them around that same time we are ssing

StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=0KiB, inbound=1KiB, 2500ms elapsed, timeout is 2500ms), command=SETEX, next: SETEX _redacted_, inst: 0, qu: 497, qs: 684, aw: True, bw: Inactive, rs: ReadAsync, ws: Flushing, in: 1092, serverEndpoint: _redacted_, mc: 1/1/0, mgr: 9 of 10 available, clientName: _redacted_, IOCP: (Busy=2,Free=998,Min=300,Max=1000), WORKER: (Busy=3,Free=32764,Min=250,Max=32767), v: 2.1.58.34321 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
StackExchange.Redis.RedisTimeoutException: Timeout performing MGET (2500ms), next: SETEX _redacted_, inst: 0, qu: 530, qs: 684, aw: True, bw: Inactive, rs: ReadAsync, ws: Flushing, in: 1092, serverEndpoint: _redacted_, mc: 1/1/0, mgr: 9 of 10 available, clientName: _redacted_, IOCP: (Busy=2,Free=998,Min=300,Max=1000), WORKER: (Busy=0,Free=32767,Min=250,Max=32767), v: 2.1.58.34321 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
   at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server)
   at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server)

@Plasma
Copy link

Plasma commented Oct 11, 2020

Out of interest if you load up Azure portal metrics for your cache instance, set the metric to Errors, if you then Split the metric in the UI, does it say a Failover happened around that time?

In our experience the 'Concurrent reads or writes are not supported' error (which we have not been able to fix) happens as a side-effect of some connectivity issue. It's only happened twice, and sometimes a failover happens without issue, seems to be a race condition that can cause the problem - and I don't have a solution (I'm just a user of the library, not a maintainer).

I did try and reproduce this and step through the code but was unable to reproduce it. Earlier comments in this issue suggested it was a downstream library issue that was rolled back in later versions of SE.Redis, but we haven't upgraded just yet.

@arsnyder16
Copy link

@Plasma I am not seeing a correlation. So over the past 30 days this has happen 19 times, there have been 6 redis errors all failovers the closest to the 'Concurrent reads or writes' errors was 3 hours, otherwise they were at least 12 hours apart

@ghost
Copy link

ghost commented Oct 12, 2020

This is a client-side race condition and is caused by a regression post 2.0.513. Right now, the easiest resolution is to walk back to the earlier version, but that version has non-trivial issues in thread management among other things and has its own time-out problems as a result. While those don't cascade, they are pretty rough. Right now, we're really at an impasse re: deployability. The earliest incidence of the issue I'm aware of is in 2.0.601 and while that's a significant number of code changes, it's a finite number.

If it would help, I'm willing to put up a bug bounty.

--J

@arsnyder16
Copy link

arsnyder16 commented Oct 13, 2020

@sassywarsat @Plasma

I was able to put together a console app that produces the issue. I haven't begun debugging yet, but i thought i would pass along in case someone more familiar with the code base would also be interesting in jumping in. It is not super consistent and some of the things might need adjust per machine. I would recommend running under Release configuration.

Seems to occur around the time there start become a lot of TimeoutExceptions.

Attached is a cs file, i am using v2.1.58, with an azure redis cache, Standard 2.5 GB
Program.zip

@arsnyder16
Copy link

@Plasma @sassywarsat @mgravell

I think I got the root of the issue, I have submitted a PR #1585

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.