Issues with Nest Connection Pool retry mechanism #1080

satishmallik · 2014-11-28T15:46:10Z

Currently nest is not able to differentiate between Service Unavailable and Timeout exception.

Problem description

Think of 3 query nodes (q1,q2,q3) in cluster. One of query node q1 is booted out of cluster for some reasone like OOM.
Now all query requests to q1 will fail with error status code 503 service not available.
A typical response where this situation arises look like,

             ConnectionStatus = {StatusCode: 503, 
              Method: POST, 
               Url: http://localhost:9200/codeindex/sourceNoDedupeFileContract/_search?routing=0&pretty=true,

Only In this case query should be routed to another query node q2.

Currently nest retires all retryable exception like timeouts on all nodes of the connection pool.
So say a wildcard query is sent to query node q1. If it timesout the query should timeout.
But in this scenario nest still sends this query to another node q2. It again timesout and in turn it is sent to q3. So query times out after 3 mins instead of 1 min.

This also makes a case of DOS attack.

Proper fix should be to differentiate between timeout and ServiceNotAvailable exception. Query should be sent on another node from connection pool only if service is not available on first node.

The text was updated successfully, but these errors were encountered:

wbsimms · 2014-12-01T16:52:29Z

Just trying to understand the code.

Is this really a problem with updating the NodeList during ping/sniff?

Mpdreamz · 2014-12-03T15:12:28Z

Hi @satishmallik been thinking about this all week there are two ways we can solve this.

The retry logic is contained in Transport and the RequestHandlers which coordinate the requests over IConnection's, they do not know what exception from the IConnection is a timeout exception (these are different for i.e thrift and webrequests).

We can solve it globbally by introducing a RetryTimeout parameter which will prevent a retry to occur if the the time between the first call and the point where we attempt to retry (again) is within 90% of the RetryTimeout. Making it a soft limit so that if we specify 1minute we wont retry after 55seconds either.

Or extend the IConnection interface with a bool FailEarly(Exception e) so that IConnections dicate short circuiting based on exceptions.

My personal preference is option 1 although i'm struggling to find sane defaults.

We could pair RetryTimeout when not explicitly set to Timeout which by default is 60s.

…luster Conflicts: src/Elasticsearch.Net/Connection/Configuration/IConnectionConfigurationValues.cs src/Elasticsearch.Net/Connection/RequestHandlers/RequestHandlerBase.cs src/Tests/Elasticsearch.Net.Tests.Unit/Elasticsearch.Net.Tests.Unit.csproj

Mpdreamz mentioned this issue Dec 5, 2014

Do not retry when timeout is hit #1093

Merged

Mpdreamz closed this as completed in 5600b11 Dec 5, 2014

gmarz added Bug labels Dec 5, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with Nest Connection Pool retry mechanism #1080

Issues with Nest Connection Pool retry mechanism #1080

satishmallik commented Nov 28, 2014

wbsimms commented Dec 1, 2014

Mpdreamz commented Dec 3, 2014

Issues with Nest Connection Pool retry mechanism #1080

Issues with Nest Connection Pool retry mechanism #1080

Comments

satishmallik commented Nov 28, 2014

wbsimms commented Dec 1, 2014

Mpdreamz commented Dec 3, 2014