Skip to content

AndroidChannels: failing requests right after network recovery #5610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
AntKos opened this issue Apr 17, 2019 · 17 comments
Open

AndroidChannels: failing requests right after network recovery #5610

AntKos opened this issue Apr 17, 2019 · 17 comments
Assignees
Milestone

Comments

@AntKos
Copy link

AntKos commented Apr 17, 2019

What version of gRPC are you using?

1.20.0

What did you expect to see?

In our project we use android JobService, which starts right after device establish network connection. Inside that service we run Grpc request over AndroidChannel. We expect to see successful network calls at that point

What actually happens.

Most of the time, this request fails few times in a row (as JobService reschedules itself on failure). At first it seems that though JobService is started, no connection is available for some reason, but then we tried to add simple http request via OkHttpClient - and they work well. Additional digging in logs and debugging reveals, that this might be related to DnsNameResolver, as it might resolve address when request is already in progress. This issue seems highly related to #4028, but occurs in AndroidChannel.

Steps to reproduce

Given mentioned setup (JobService + AndroidChannel), it requires android device with API 24+. Issue can be reproduced by toggling off device internet connection (both WiFi and mobile, seems no difference), wait around 10-15 seconds, and toggling on connection.
Example project, which can demonstrate this issue can be found at https://github.com/AntKos/android-grpc-jobservice.

@voidzcy
Copy link
Contributor

voidzcy commented Apr 19, 2019

@AntKos Can you provide us the .proto for HealthServiceGrpc? I cannot find it anywhere.

@AntKos
Copy link
Author

AntKos commented Apr 19, 2019

Sure thing:

syntax = "proto3";
package <enter any>;

import "google/api/annotations.proto";

service HealthService {
    rpc echo (HealthPacket.EchoRequest) returns (HealthPacket.EchoResponse) {
        option (google.api.http) = {
            post: "/health/echo"
            body: "*"
        };
    }
}

message HealthPacket {
    message EchoRequest {
        string message = 1;
    }

    message EchoResponse {
        string echo = 1;
    }
}

By the way, a workaround for this is to return any big value from DnsNameResolver::getNetworkAddressCacheTtlNanos. That's not right, but just in case anyone looks for it.

@voidzcy
Copy link
Contributor

voidzcy commented Apr 19, 2019

@AntKos I tried to run your example on Android emulator and it works for localhost (gonna try with some remote server later). Does this issue only occurs on physical devices? If it does, can you tell us the specific device model you tested on? Also, for the OkHttpClient used as comparison, have you ever tried a different target instead of "www.google.com" (as google.com is highly likely to be cached by DNS)?

Update: We will try to consistently reproduce the issue and dig into it.

@voidzcy
Copy link
Contributor

voidzcy commented Apr 22, 2019

@AntKos

Previously I thought the issue get reproduced, but later I realized that the failed requests were due to remote service running on an unreachable network.

Later, I tried with the loopback hostname (loopback46.unittest.grpc.io, which corresponds to an entry in dns server that points to ip address 127.0.0.1) with a local server. From what I've seen, DnsNameResolver has no problem resolving the hostname (and it does perform resolution) every time the connection comes back.
(Note, with this approach, you won't be able to succeed the call as the resolved ip address would be 127.0.0.1, which is not the localhost ip for Android emulator. You could intercept the resolved address to 10.0.2.2 to make the call successful).

I did all my testing on Android emulator. That's said, I cannot reproduce the issue you mentioned. I am suspecting this might be a device specific issue. So I would like you to do me a favor as the following:

  • Try to reproduce the issue on an emulator on your side as well to see if this only happens on physical devices.
  • If the issue only happens on physical devices, could you please provide us the specific model you tested on?

@AntKos
Copy link
Author

AntKos commented Apr 23, 2019

@voidzcy

I tried to systemize my observations, and here is what we have:

I didin't test WiFi/Mobile networking switches, but if those are required - please write me. Did tests on three real devices and one emulator, all above API25 (as that is target API in our project for now).

  1. NEXUS 5X API 27 (Android 8.1.0) - only WiFi testing, not reproduced.
  2. Honor 10 Lite API 28 (Android 9) - not reproduced on WiFi, reproduced on mobile internet (4G).
  3. IS520.1 API 25 (Android 7.1.2) - only WiFi testing, reproduced.
  4. Emulator Nexus 6 API 27 - not reproduced on WiFi, but a different behaviour on mobile network compared to real devices.

Tests were started on fresh application run with enabled network by default. Host is remote. onAvailable and onLost mentioned below are corresponding methods in AndroidChannelBuilder.NetworkCallback attached to AndroidChannel instance. SyncService is an implementation of JobService, which starts by OS on network connection establisment.

Detailed observations on Honor 10 Lite API 28 (Android 9):

WiFi case:
IDLE -> turn off wifi -> onLost triggered -> {nothing happened} -> turn on -> onAvailable triggered
-> SyncService started by system -> start requests -> resolve dns(successfully) -> end requests(successfully) -> READY ->turn off wifi -> immediately start dns resolving(failed, restart with backoff) -> onLost triggered-> TRANSIENT_FAILURE -> turn on -> onAvailable triggered ->
-> try resolve dns(success here) -> SyncService started by system -> start requests -> end requests(successfully).

Mobile network case:
IDLE -> turn off mobile net -> onLost triggered -> {nothing happened} -> turn on -> onAvailable triggered -> SyncService started by system -> start requests -> resolve dns(successfully)
-> end requests(successfully) -> READY -> turn off mobile net -> immedialely start dns resolving(failed, restart with backoff) -> onLost triggered -> TRANSIENT_FAILURE -> turn on -> onAvailable triggered -> try resolve dns(in progress here) -> SyncService started by system -> start requests -> end requests(failed) -> resolve dns successfully -> SyncService restarted by system(10s backoff) -> requests started -> success.

The difference in mobile and WiFi cases is in network recovery part - it starts always before system starts SyncService, but with WiFi dns resolution goes fast, and with 3G at the moment we try to send request we still have no resolved dns.

What happens on Emulator(Nexus 6 API 27)

WiFi case: works OK.
Mobile network:
IDLE -> turn off mobile net -> onLost triggered -> {nothing happened} -> turn on -> onAvailable triggered -> SyncService started by system -> start requests -> resolve dns(successfully)
-> end requests(successfully) -> READY -> turn off mobile net -> channel remains READY, nothing happened -> turn on -> onAvailable triggered -> SyncService started by system -> start requests -> request never ends, no dns resolution happens at all.

My 5 cents are that when channel is IDLE or READY - then dns resolution starts synchronously with rpc request. But when in TRANSIENT_FAILURE, it runs with backoff asynchronously, and outcoming requests may not be in sync with dns. Have no idea to what happened on emulated device.

@voidzcy
Copy link
Contributor

voidzcy commented Apr 23, 2019

@AntKos

Thanks so much for such a detailed observation! Now I got exactly the same behaviour as you said in emulator for mobile network. We will dig into it.

@ejona86 ejona86 added this to the Next milestone Apr 23, 2019
@ejona86 ejona86 added the bug label Apr 23, 2019
@voidzcy
Copy link
Contributor

voidzcy commented Apr 24, 2019

@AntKos

For emulator:
This is related to an Android emulator bug we've seen before. When toggling off mobile network (either toggling off the cellular data or turning on the airplane mode), the underlying socket is not closed by Android OS until TCP connection timeout occurs (which is about 10~15 min). So this would not trigger channel state change. That's why you see channel remains READY. After you turn on the mobile net again, the channel would assume the connection is still open and is ready to process the call though it is actually not. That's why you see the request never ends.

The above should only happen on emulators and would not occur on real devices.

We will look into the behaviors on real devices later once we are able to test on physical devices.

@voidzcy
Copy link
Contributor

voidzcy commented Apr 29, 2019

@AntKos

For your observation on physical devices under cellular network. At the moment you turn on mobile network (channel is in TRANSIENT_FAILURE), it first refreshes name resolution before changing channel state to CONNECTING. This refresh might be slow under cellular network, while on the other hand SyncService starts to send RPCs. They fail immediately for TRANSIENT_FAILURE channel state as you observed since the default failure strategy is fail-fast. If you do want to avoid this somewhat unnecessary failure, you can configure with withWaitForReady() in creating your stub.

By the way, are you consistently getting the same behavior for IS520.1 API 25 (Android 7.1.2) devices under WiFi network? The above could also cause the same behavior (probably with fewer chances) even under WiFi network if refreshing the name resolution and channel state change are not completed before requests are initiated to send. We are currently thinking about if there are things we could do to reconcile this behavior.

@hijamoya
Copy link

hijamoya commented Mar 9, 2020

I faced a similar issue, say we have a foreground service which always running on the phone, then we turn off screen and wait a period (~4 or 5 hours) until the device enter doze idle mode.
After that, we turn on the screen and the foreground service can not send any request by the AndroidChannel (but the rest api by retrofit is still working), the only way to recover is to turn off then turn on the wifi to make the channel reconnect again.

The follow properties we logged for the channel when the situation happened:
isShutdown: false
isTerminated: false
state: READY
Where the AndroidChannel is a global singleton for the application.

Thanks.

@sanjaypujare
Copy link
Contributor

@hijamoya pls add your Android and API versions. Also @voidzcy was looking into this so let's wait for his comments.

@hijamoya
Copy link

hijamoya commented Mar 10, 2020

@sailat
Device: Samsung galaxy S8
Android version: 8.0.0
grpc version: 1.24.0
And the breadcrumbs for the situation report by bugsnag:
Screen Shot 2020-03-10 at 9 06 08 AM

Our foreground service send request every 5 minutes, it seems the connection for grpc does not break when enter doze mode, after a while (~2 hour) the connection just lost but the application didn't know that. Thus, the reconnection never happens since the AndroidChannel's state is still READY.

@hijamoya
Copy link

hijamoya commented Mar 27, 2020

Currently, we just recreate an AndroidChannel if deadline exceed and retry after the channel is built. However, this is just a workaround, we still expect the AndroidChannel can reconnect automatically when the situation occurs.

@AlbertePeng
Copy link

I also encountered a similar problem. Is it going to solve this problem?

@voidzcy
Copy link
Contributor

voidzcy commented Apr 20, 2020

Yes, it is still in our bug list and we had brief understanding for this issue, while we are currently working on more prioritized issues. PRs or any debugging setup to easily reproduce the issue is welcome.

@jworldexpo
Copy link

Currently, we just recreate an AndroidChannel if deadline exceed and retry after the channel is built. However, this is just a workaround, we still expect the AndroidChannel can reconnect automatically when the situation occurs.

hello i also encountered a similar problem. how do you handle it ?

@hijamoya

@hijamoya
Copy link

@jworldexpo Just new a channel when error occurs and retry...

@voidzcy
Copy link
Contributor

voidzcy commented Dec 10, 2020

@jworldexpo, @hijamoya You may want to try with 1.34.0 release (the release may have other issues that affect you, see release note. Or you can wait for the soon-to-be-release 1.34.1), we had some change that intend to fix this issue. Please let us know if the problem persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants