AndroidChannels: failing requests right after network recovery #5610

AntKos · 2019-04-17T12:44:07Z

What version of gRPC are you using?

1.20.0

What did you expect to see?

In our project we use android JobService, which starts right after device establish network connection. Inside that service we run Grpc request over AndroidChannel. We expect to see successful network calls at that point

What actually happens.

Most of the time, this request fails few times in a row (as JobService reschedules itself on failure). At first it seems that though JobService is started, no connection is available for some reason, but then we tried to add simple http request via OkHttpClient - and they work well. Additional digging in logs and debugging reveals, that this might be related to DnsNameResolver, as it might resolve address when request is already in progress. This issue seems highly related to #4028, but occurs in AndroidChannel.

Steps to reproduce

Given mentioned setup (JobService + AndroidChannel), it requires android device with API 24+. Issue can be reproduced by toggling off device internet connection (both WiFi and mobile, seems no difference), wait around 10-15 seconds, and toggling on connection.
Example project, which can demonstrate this issue can be found at https://github.com/AntKos/android-grpc-jobservice.

The text was updated successfully, but these errors were encountered:

voidzcy · 2019-04-19T17:27:39Z

@AntKos Can you provide us the .proto for HealthServiceGrpc? I cannot find it anywhere.

AntKos · 2019-04-19T18:13:56Z

Sure thing:

syntax = "proto3";
package <enter any>;

import "google/api/annotations.proto";

service HealthService {
    rpc echo (HealthPacket.EchoRequest) returns (HealthPacket.EchoResponse) {
        option (google.api.http) = {
            post: "/health/echo"
            body: "*"
        };
    }
}

message HealthPacket {
    message EchoRequest {
        string message = 1;
    }

    message EchoResponse {
        string echo = 1;
    }
}

By the way, a workaround for this is to return any big value from DnsNameResolver::getNetworkAddressCacheTtlNanos. That's not right, but just in case anyone looks for it.

voidzcy · 2019-04-19T19:01:32Z

@AntKos I tried to run your example on Android emulator and it works for localhost (gonna try with some remote server later). Does this issue only occurs on physical devices? If it does, can you tell us the specific device model you tested on? Also, for the OkHttpClient used as comparison, have you ever tried a different target instead of "www.google.com" (as google.com is highly likely to be cached by DNS)?

Update: We will try to consistently reproduce the issue and dig into it.

voidzcy · 2019-04-22T22:05:26Z

@AntKos

Previously I thought the issue get reproduced, but later I realized that the failed requests were due to remote service running on an unreachable network.

Later, I tried with the loopback hostname (loopback46.unittest.grpc.io, which corresponds to an entry in dns server that points to ip address 127.0.0.1) with a local server. From what I've seen, DnsNameResolver has no problem resolving the hostname (and it does perform resolution) every time the connection comes back.
(Note, with this approach, you won't be able to succeed the call as the resolved ip address would be 127.0.0.1, which is not the localhost ip for Android emulator. You could intercept the resolved address to 10.0.2.2 to make the call successful).

I did all my testing on Android emulator. That's said, I cannot reproduce the issue you mentioned. I am suspecting this might be a device specific issue. So I would like you to do me a favor as the following:

Try to reproduce the issue on an emulator on your side as well to see if this only happens on physical devices.
If the issue only happens on physical devices, could you please provide us the specific model you tested on?

AntKos · 2019-04-23T10:47:46Z

@voidzcy

I tried to systemize my observations, and here is what we have:

I didin't test WiFi/Mobile networking switches, but if those are required - please write me. Did tests on three real devices and one emulator, all above API25 (as that is target API in our project for now).

NEXUS 5X API 27 (Android 8.1.0) - only WiFi testing, not reproduced.
Honor 10 Lite API 28 (Android 9) - not reproduced on WiFi, reproduced on mobile internet (4G).
IS520.1 API 25 (Android 7.1.2) - only WiFi testing, reproduced.
Emulator Nexus 6 API 27 - not reproduced on WiFi, but a different behaviour on mobile network compared to real devices.

Tests were started on fresh application run with enabled network by default. Host is remote. onAvailable and onLost mentioned below are corresponding methods in AndroidChannelBuilder.NetworkCallback attached to AndroidChannel instance. SyncService is an implementation of JobService, which starts by OS on network connection establisment.

Detailed observations on Honor 10 Lite API 28 (Android 9):

WiFi case:
IDLE -> turn off wifi -> onLost triggered -> {nothing happened} -> turn on -> onAvailable triggered
-> SyncService started by system -> start requests -> resolve dns(successfully) -> end requests(successfully) -> READY ->turn off wifi -> immediately start dns resolving(failed, restart with backoff) -> onLost triggered-> TRANSIENT_FAILURE -> turn on -> onAvailable triggered ->
-> try resolve dns(success here) -> SyncService started by system -> start requests -> end requests(successfully).

Mobile network case:
IDLE -> turn off mobile net -> onLost triggered -> {nothing happened} -> turn on -> onAvailable triggered -> SyncService started by system -> start requests -> resolve dns(successfully)
-> end requests(successfully) -> READY -> turn off mobile net -> immedialely start dns resolving(failed, restart with backoff) -> onLost triggered -> TRANSIENT_FAILURE -> turn on -> onAvailable triggered -> try resolve dns(in progress here) -> SyncService started by system -> start requests -> end requests(failed) -> resolve dns successfully -> SyncService restarted by system(10s backoff) -> requests started -> success.

The difference in mobile and WiFi cases is in network recovery part - it starts always before system starts SyncService, but with WiFi dns resolution goes fast, and with 3G at the moment we try to send request we still have no resolved dns.

What happens on Emulator(Nexus 6 API 27)

WiFi case: works OK.
Mobile network:
IDLE -> turn off mobile net -> onLost triggered -> {nothing happened} -> turn on -> onAvailable triggered -> SyncService started by system -> start requests -> resolve dns(successfully)
-> end requests(successfully) -> READY -> turn off mobile net -> channel remains READY, nothing happened -> turn on -> onAvailable triggered -> SyncService started by system -> start requests -> request never ends, no dns resolution happens at all.

My 5 cents are that when channel is IDLE or READY - then dns resolution starts synchronously with rpc request. But when in TRANSIENT_FAILURE, it runs with backoff asynchronously, and outcoming requests may not be in sync with dns. Have no idea to what happened on emulated device.

voidzcy · 2019-04-23T16:59:15Z

@AntKos

Thanks so much for such a detailed observation! Now I got exactly the same behaviour as you said in emulator for mobile network. We will dig into it.

voidzcy · 2019-04-24T00:08:52Z

@AntKos

For emulator:
This is related to an Android emulator bug we've seen before. When toggling off mobile network (either toggling off the cellular data or turning on the airplane mode), the underlying socket is not closed by Android OS until TCP connection timeout occurs (which is about 10~15 min). So this would not trigger channel state change. That's why you see channel remains READY. After you turn on the mobile net again, the channel would assume the connection is still open and is ready to process the call though it is actually not. That's why you see the request never ends.

The above should only happen on emulators and would not occur on real devices.

We will look into the behaviors on real devices later once we are able to test on physical devices.

voidzcy · 2019-04-29T20:17:06Z

@AntKos

For your observation on physical devices under cellular network. At the moment you turn on mobile network (channel is in TRANSIENT_FAILURE), it first refreshes name resolution before changing channel state to CONNECTING. This refresh might be slow under cellular network, while on the other hand SyncService starts to send RPCs. They fail immediately for TRANSIENT_FAILURE channel state as you observed since the default failure strategy is fail-fast. If you do want to avoid this somewhat unnecessary failure, you can configure with withWaitForReady() in creating your stub.

By the way, are you consistently getting the same behavior for IS520.1 API 25 (Android 7.1.2) devices under WiFi network? The above could also cause the same behavior (probably with fewer chances) even under WiFi network if refreshing the name resolution and channel state change are not completed before requests are initiated to send. We are currently thinking about if there are things we could do to reconcile this behavior.

hijamoya · 2020-03-09T10:07:40Z

I faced a similar issue, say we have a foreground service which always running on the phone, then we turn off screen and wait a period (~4 or 5 hours) until the device enter doze idle mode.
After that, we turn on the screen and the foreground service can not send any request by the AndroidChannel (but the rest api by retrofit is still working), the only way to recover is to turn off then turn on the wifi to make the channel reconnect again.

The follow properties we logged for the channel when the situation happened:
isShutdown: false
isTerminated: false
state: READY
Where the AndroidChannel is a global singleton for the application.

Thanks.

sanjaypujare · 2020-03-09T16:33:15Z

@hijamoya pls add your Android and API versions. Also @voidzcy was looking into this so let's wait for his comments.

hijamoya · 2020-03-10T01:04:06Z

@sailat
Device: Samsung galaxy S8
Android version: 8.0.0
grpc version: 1.24.0
And the breadcrumbs for the situation report by bugsnag:

Our foreground service send request every 5 minutes, it seems the connection for grpc does not break when enter doze mode, after a while (~2 hour) the connection just lost but the application didn't know that. Thus, the reconnection never happens since the AndroidChannel's state is still READY.

hijamoya · 2020-03-27T10:43:27Z

Currently, we just recreate an AndroidChannel if deadline exceed and retry after the channel is built. However, this is just a workaround, we still expect the AndroidChannel can reconnect automatically when the situation occurs.

AlbertePeng · 2020-04-18T03:55:10Z

I also encountered a similar problem. Is it going to solve this problem?

voidzcy · 2020-04-20T19:22:14Z

Yes, it is still in our bug list and we had brief understanding for this issue, while we are currently working on more prioritized issues. PRs or any debugging setup to easily reproduce the issue is welcome.

jworldexpo · 2020-12-10T07:31:02Z

Currently, we just recreate an AndroidChannel if deadline exceed and retry after the channel is built. However, this is just a workaround, we still expect the AndroidChannel can reconnect automatically when the situation occurs.

hello i also encountered a similar problem. how do you handle it ?

@hijamoya

hijamoya · 2020-12-10T07:51:26Z

@jworldexpo Just new a channel when error occurs and retry...

voidzcy · 2020-12-10T18:33:19Z

@jworldexpo, @hijamoya You may want to try with 1.34.0 release (the release may have other issues that affect you, see release note. Or you can wait for the soon-to-be-release 1.34.1), we had some change that intend to fix this issue. Please let us know if the problem persists.

ejona86 assigned voidzcy Apr 19, 2019

ejona86 added this to the Next milestone Apr 23, 2019

ejona86 added the bug label Apr 23, 2019

voidzcy mentioned this issue Nov 10, 2020

android: make Channel always enterIdle upon network recover #7611

Merged

larssn mentioned this issue Jan 28, 2022

🐛 [cloud_firestore] After some time, Cloud Firestore sync stops working firebase/flutterfire#6749

Closed

ejona86 added the android label Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AndroidChannels: failing requests right after network recovery #5610

AndroidChannels: failing requests right after network recovery #5610

AntKos commented Apr 17, 2019 •

edited

Loading

voidzcy commented Apr 19, 2019

AntKos commented Apr 19, 2019

voidzcy commented Apr 19, 2019 •

edited

Loading

voidzcy commented Apr 22, 2019

AntKos commented Apr 23, 2019 •

edited

Loading

voidzcy commented Apr 23, 2019

voidzcy commented Apr 24, 2019 •

edited

Loading

voidzcy commented Apr 29, 2019

hijamoya commented Mar 9, 2020 •

edited

Loading

sanjaypujare commented Mar 9, 2020

hijamoya commented Mar 10, 2020 •

edited

Loading

hijamoya commented Mar 27, 2020 •

edited

Loading

AlbertePeng commented Apr 18, 2020

voidzcy commented Apr 20, 2020

jworldexpo commented Dec 10, 2020

hijamoya commented Dec 10, 2020

voidzcy commented Dec 10, 2020 •

edited

Loading

AndroidChannels: failing requests right after network recovery #5610

AndroidChannels: failing requests right after network recovery #5610

Comments

AntKos commented Apr 17, 2019 • edited Loading

What version of gRPC are you using?

What did you expect to see?

What actually happens.

Steps to reproduce

voidzcy commented Apr 19, 2019

AntKos commented Apr 19, 2019

voidzcy commented Apr 19, 2019 • edited Loading

voidzcy commented Apr 22, 2019

AntKos commented Apr 23, 2019 • edited Loading

Detailed observations on Honor 10 Lite API 28 (Android 9):

What happens on Emulator(Nexus 6 API 27)

voidzcy commented Apr 23, 2019

voidzcy commented Apr 24, 2019 • edited Loading

voidzcy commented Apr 29, 2019

hijamoya commented Mar 9, 2020 • edited Loading

sanjaypujare commented Mar 9, 2020

hijamoya commented Mar 10, 2020 • edited Loading

hijamoya commented Mar 27, 2020 • edited Loading

AlbertePeng commented Apr 18, 2020

voidzcy commented Apr 20, 2020

jworldexpo commented Dec 10, 2020

hijamoya commented Dec 10, 2020

voidzcy commented Dec 10, 2020 • edited Loading

AntKos commented Apr 17, 2019 •

edited

Loading

voidzcy commented Apr 19, 2019 •

edited

Loading

AntKos commented Apr 23, 2019 •

edited

Loading

voidzcy commented Apr 24, 2019 •

edited

Loading

hijamoya commented Mar 9, 2020 •

edited

Loading

hijamoya commented Mar 10, 2020 •

edited

Loading

hijamoya commented Mar 27, 2020 •

edited

Loading

voidzcy commented Dec 10, 2020 •

edited

Loading