-
Notifications
You must be signed in to change notification settings - Fork 3.9k
core: use exponential backoff for name resolution #4105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: use exponential backoff for name resolution #4105
Conversation
This will also resolve #4028 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems there should be a documentation update to NameResolver saying that grpc is responsible for calling refresh after an error.
return; | ||
} | ||
if (nameResolverBackoffFuture != null) { | ||
cancelNameResolverBackoff(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, could you add an assert nameResolverStarted;
? That also helps during auditing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
+@zhangkun83 for the NameResolver error+refresh semantic change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation LGTM except for a comment on naming. Left a few comments on tests.
@@ -402,6 +404,45 @@ public void run() { | |||
idleTimeoutMillis, TimeUnit.MILLISECONDS); | |||
} | |||
|
|||
// Run from channelExecutor | |||
private class NameResolverBackoff implements Runnable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this Runnable doesn't actually back off. It is the refresh task that is backed off. Maybe something like "NameResolverBackedOffRetry"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (changed to NameResolverRefresh
)
|
||
timer.forwardNanos(RECONNECT_BACKOFF_INTERVAL_NANOS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To verify refresh() is not called sooner than expected, you will need to forward by RECONNECT_BACKOFF_INTERVAL_NANOS - 1
and verify that refresh() is not called, then forward by 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
resolver.error = null; | ||
|
||
// For the second attempt, the backoff should occur at RECONNECT_BACKOFF_INTERVAL_NANOS * 2 | ||
timer.forwardNanos(RECONNECT_BACKOFF_INTERVAL_NANOS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. The forward time here should be RECONNECT_BACKOFF_INTERVAL_NANOS * 2 - 1, and the next forward should be 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
// Verify that the successful resolution reset the backoff policy | ||
resolver.listener.onError(error); | ||
timer.forwardNanos(RECONNECT_BACKOFF_INTERVAL_NANOS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
assertEquals(1, resolver.refreshCalled); | ||
verify(mockLoadBalancer, times(2)).handleNameResolutionError(same(error)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To verify that you will not schedule duplicate timers, maybe also call refresh() here, and verify handleNameResolutionError()
is called twice, and there is still one timer scheduled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
assertNotNull(nameResolverBackoff); | ||
assertFalse(nameResolverBackoff.isCancelled()); | ||
|
||
channel.shutdown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shutdown() doesn't cancel the timer. delayedTransport termination does. To verify this, you'd need to start a call that stops delayedTransport from terminating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
@Override public String getServiceAuthority() { | ||
return expectedUri.getAuthority(); | ||
} | ||
|
||
@Override public void start(final Listener listener) { | ||
this.listener = listener; | ||
if (error != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we consolidate this with the error simulation in refresh() and put it in resolved()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also made a larger scale refactor to switch to a builder for the FakeNameResolverFactory, as it was confusing sorting through the four different constructors and their different default values.
private FakeClock.ScheduledTask getNameResolverBackoff() { | ||
FakeClock.ScheduledTask nameResolverBackoff = null; | ||
for (FakeClock.ScheduledTask task : timer.getPendingTasks()) { | ||
if (task.command.toString().contains("NameResolverBackoff")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may consider using getPendingTasks(TaskFilter filter)
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please wait for Kun's re-review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a minor comment.
assertFalse(task.isDone()); | ||
nameResolverRefresh = task; | ||
} | ||
for (FakeClock.ScheduledTask task : timer.getPendingTasks(NAME_RESOLVER_REFRESH_TASK_FILTER)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified to:
nameResolverRefresh = Iterables.getOnlyElement(timer.getPendingTasks(NAME_RESOLVER_REFRESH_TASK_FILTER));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks
This addresses #3685.
Drops the fixed 60-second backoff timer from
DnsNameResolver
. Instead,ManagedChannelImpl
will use its backoff policy to invokeNameResolver.refresh()
whenNameResolver.Listener.onError
signals that an error occurred during resolution.