Skip to content

Issues with Nest Connection Pool retry mechanism #1080

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
satishmallik opened this issue Nov 28, 2014 · 2 comments
Closed

Issues with Nest Connection Pool retry mechanism #1080

satishmallik opened this issue Nov 28, 2014 · 2 comments

Comments

@satishmallik
Copy link

Currently nest is not able to differentiate between Service Unavailable and Timeout exception.

Problem description

  1. Think of 3 query nodes (q1,q2,q3) in cluster. One of query node q1 is booted out of cluster for some reasone like OOM.
  2. Now all query requests to q1 will fail with error status code 503 service not available.
    A typical response where this situation arises look like,
             ConnectionStatus = {StatusCode: 503, 
              Method: POST, 
               Url: http://localhost:9200/codeindex/sourceNoDedupeFileContract/_search?routing=0&pretty=true, 
  1. Only In this case query should be routed to another query node q2.

Currently nest retires all retryable exception like timeouts on all nodes of the connection pool.
So say a wildcard query is sent to query node q1. If it timesout the query should timeout.
But in this scenario nest still sends this query to another node q2. It again timesout and in turn it is sent to q3. So query times out after 3 mins instead of 1 min.

This also makes a case of DOS attack.

Proper fix should be to differentiate between timeout and ServiceNotAvailable exception. Query should be sent on another node from connection pool only if service is not available on first node.

@wbsimms
Copy link
Contributor

wbsimms commented Dec 1, 2014

Just trying to understand the code.

Is this really a problem with updating the NodeList during ping/sniff?

@Mpdreamz
Copy link
Member

Mpdreamz commented Dec 3, 2014

Hi @satishmallik been thinking about this all week there are two ways we can solve this.

The retry logic is contained in Transport and the RequestHandlers which coordinate the requests over IConnection's, they do not know what exception from the IConnection is a timeout exception (these are different for i.e thrift and webrequests).

We can solve it globbally by introducing a RetryTimeout parameter which will prevent a retry to occur if the the time between the first call and the point where we attempt to retry (again) is within 90% of the RetryTimeout. Making it a soft limit so that if we specify 1minute we wont retry after 55seconds either.

Or extend the IConnection interface with a bool FailEarly(Exception e) so that IConnections dicate short circuiting based on exceptions.

My personal preference is option 1 although i'm struggling to find sane defaults.

We could pair RetryTimeout when not explicitly set to Timeout which by default is 60s.

@gmarz gmarz added Bug labels Dec 5, 2014
Mpdreamz added a commit that referenced this issue Dec 9, 2014
…luster

Conflicts:
	src/Elasticsearch.Net/Connection/Configuration/IConnectionConfigurationValues.cs
	src/Elasticsearch.Net/Connection/RequestHandlers/RequestHandlerBase.cs
	src/Tests/Elasticsearch.Net.Tests.Unit/Elasticsearch.Net.Tests.Unit.csproj
Mpdreamz added a commit that referenced this issue Dec 9, 2014
…luster

Conflicts:
	src/Elasticsearch.Net/Connection/Configuration/IConnectionConfigurationValues.cs
	src/Elasticsearch.Net/Connection/RequestHandlers/RequestHandlerBase.cs
	src/Tests/Elasticsearch.Net.Tests.Unit/Elasticsearch.Net.Tests.Unit.csproj
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants