Skip to content

core: use exponential backoff for name resolution #4105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 6, 2018

Conversation

ericgribkoff
Copy link
Contributor

This addresses #3685.

Drops the fixed 60-second backoff timer from DnsNameResolver. Instead, ManagedChannelImpl will use its backoff policy to invoke NameResolver.refresh() when NameResolver.Listener.onError signals that an error occurred during resolution.

@ericgribkoff
Copy link
Contributor Author

This will also resolve #4028

Copy link
Member

@ejona86 ejona86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there should be a documentation update to NameResolver saying that grpc is responsible for calling refresh after an error.

return;
}
if (nameResolverBackoffFuture != null) {
cancelNameResolverBackoff();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, could you add an assert nameResolverStarted;? That also helps during auditing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ejona86 ejona86 requested a review from zhangkun83 March 1, 2018 17:46
@ejona86
Copy link
Member

ejona86 commented Mar 1, 2018

+@zhangkun83 for the NameResolver error+refresh semantic change

Copy link
Contributor

@zhangkun83 zhangkun83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation LGTM except for a comment on naming. Left a few comments on tests.

@@ -402,6 +404,45 @@ public void run() {
idleTimeoutMillis, TimeUnit.MILLISECONDS);
}

// Run from channelExecutor
private class NameResolverBackoff implements Runnable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this Runnable doesn't actually back off. It is the refresh task that is backed off. Maybe something like "NameResolverBackedOffRetry"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (changed to NameResolverRefresh)


timer.forwardNanos(RECONNECT_BACKOFF_INTERVAL_NANOS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To verify refresh() is not called sooner than expected, you will need to forward by RECONNECT_BACKOFF_INTERVAL_NANOS - 1 and verify that refresh() is not called, then forward by 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

resolver.error = null;

// For the second attempt, the backoff should occur at RECONNECT_BACKOFF_INTERVAL_NANOS * 2
timer.forwardNanos(RECONNECT_BACKOFF_INTERVAL_NANOS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. The forward time here should be RECONNECT_BACKOFF_INTERVAL_NANOS * 2 - 1, and the next forward should be 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// Verify that the successful resolution reset the backoff policy
resolver.listener.onError(error);
timer.forwardNanos(RECONNECT_BACKOFF_INTERVAL_NANOS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


assertEquals(1, resolver.refreshCalled);
verify(mockLoadBalancer, times(2)).handleNameResolutionError(same(error));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To verify that you will not schedule duplicate timers, maybe also call refresh() here, and verify handleNameResolutionError() is called twice, and there is still one timer scheduled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

assertNotNull(nameResolverBackoff);
assertFalse(nameResolverBackoff.isCancelled());

channel.shutdown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shutdown() doesn't cancel the timer. delayedTransport termination does. To verify this, you'd need to start a call that stops delayedTransport from terminating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


@Override public String getServiceAuthority() {
return expectedUri.getAuthority();
}

@Override public void start(final Listener listener) {
this.listener = listener;
if (error != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we consolidate this with the error simulation in refresh() and put it in resolved()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also made a larger scale refactor to switch to a builder for the FakeNameResolverFactory, as it was confusing sorting through the four different constructors and their different default values.

private FakeClock.ScheduledTask getNameResolverBackoff() {
FakeClock.ScheduledTask nameResolverBackoff = null;
for (FakeClock.ScheduledTask task : timer.getPendingTasks()) {
if (task.command.toString().contains("NameResolverBackoff")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may consider using getPendingTasks(TaskFilter filter) instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@ejona86 ejona86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wait for Kun's re-review.

Copy link
Contributor

@zhangkun83 zhangkun83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a minor comment.

assertFalse(task.isDone());
nameResolverRefresh = task;
}
for (FakeClock.ScheduledTask task : timer.getPendingTasks(NAME_RESOLVER_REFRESH_TASK_FILTER)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified to:

nameResolverRefresh = Iterables.getOnlyElement(timer.getPendingTasks(NAME_RESOLVER_REFRESH_TASK_FILTER));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

@ericgribkoff ericgribkoff merged commit ae1fb94 into grpc:master Mar 6, 2018
@ericgribkoff ericgribkoff deleted the exponential_backoff_in_dns branch March 6, 2018 05:46
@lock lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants