Skip to content

Assertion Failure in 1.16.0 for code that worked in 1.15.0 #1598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
edelabar opened this issue May 18, 2023 · 8 comments · Fixed by #1603
Closed

Assertion Failure in 1.16.0 for code that worked in 1.15.0 #1598

edelabar opened this issue May 18, 2023 · 8 comments · Fixed by #1603
Labels
kind/support Adopter support requests.

Comments

@edelabar
Copy link

What are you trying to achieve?

We have a GRPC channel in our iOS app that our load balancer times-out after 2 minutes of inactivity. The following code worked until 1.16.0:

let eventsStream = client.readEvents(readEventsRequest)
do {
	for try await grpcEvent in eventsStream {
		delegate?.receivedEvent(grpcEvent)
	}
} catch {
	// Timeout is configured on the load balancer, currently set to 2 min.
	NSLog("Disconnected from GRPC Channel: \(error)")
        // Restart the eventsStream
}

The channel is relatively inactive so when the load balancer disconnects, we just reestablish the connection. This worked fine from roughly grpc-swift version 1.8.2 through 1.15.0, but as of 1.16.0 when the exception is caught, the code that closes the connection also fails an assertion in the GRPC package target which causes the app to crash.

The assertion failure crash itself doesn't hit our code, it's all in a background thread, but I noticed this change in 1.16.0 which the assertion failure passes-through, which led me to reverting to 1.15.0. Full commit is here: 75b390e but unfortunately I don't have enough time or understanding to dig into the guts of all this. Any help is greatly appreciated.

What have you tried so far?

I've looked through the docs to confirm I'm doing things correctly, and best I can tell I am. Reverting to 1.15.0 solves the problem for now, but I was hoping to figure out if I was doing something wrong or if a regression was introduced in 1.16.0.

@edelabar edelabar added the kind/support Adopter support requests. label May 18, 2023
@glbrntt
Copy link
Collaborator

glbrntt commented May 19, 2023

Interesting, we must not be correctly counting streams somehow. We appear to be over-returning streams; we'll need to look into this to understand it better.

@edelabar
Copy link
Author

If it matters, that log prints: Disconnected from GRPC Channel: Invalid HTTP response status, but gRPC status was present. If there is anything I can do to help, or anything I can try please let me know.

@glbrntt
Copy link
Collaborator

glbrntt commented May 24, 2023

I wasn't able to reproduce this. @edelabar would you be able to run this with a connection pool delegate which logs all the relevant info (something like the snippet below)? This should hopefully provide enough context so we can understand how we reach the assertion failure.

struct LoggingPoolDelegate: GRPCConnectionPoolDelegate {
  func connectionAdded(id: GRPCConnectionID) {
    print(#function, id)
  }
  
  func connectionRemoved(id: GRPCConnectionID) {
    print(#function, id)
  }
  
  func startedConnecting(id: GRPCConnectionID) {
    print(#function, id)
  }
  
  func connectFailed(id: GRPCConnectionID, error: Error) {
    print(#function, id, error)
  }
  
  func connectSucceeded(id: GRPCConnectionID, streamCapacity: Int) {
    print(#function, id, streamCapacity)
  }
  
  func connectionUtilizationChanged(id: GRPCConnectionID, streamsUsed: Int, streamCapacity: Int) {
    print(#function, id, streamsUsed, streamCapacity)
  }
  
  func connectionQuiescing(id: GRPCConnectionID) {
    print(#function, id)
  }
  
  func connectionClosed(id: GRPCConnectionID, error: Error?) {
    print(#function, String(describing: error))
  }
}

@edelabar
Copy link
Author

edelabar commented May 25, 2023

Here are the logs I collected:

2023-05-24 16:48:08.336611-0400 Otter-QA[58272:531747] Disconnected from Notification Service listener! Retrying...: Invalid HTTP response status, but gRPC status was present [NotificationClientGRPC.swift:136]
🌊: notifications connectionClosed(id:error:) Optional(NIOHTTP2.NIOHTTP2Errors.NoSuchStream(streamID: HTTP2StreamID(1), location: "NIOHTTP2/ConnectionStreamsState.swift:305"))
🌊: notifications startedConnecting(id:) ObjectIdentifier(0x0000000130a0de00)
🌊: notifications connectionClosed(id:error:) nil
🌊: notifications connectionRemoved(id:) ObjectIdentifier(0x0000000130a0de00)
🌊: notifications connectionAdded(id:) ObjectIdentifier(0x0000000130a0ed10)
🌊: notifications startedConnecting(id:) ObjectIdentifier(0x0000000130a0ed10)
🌊: notifications connectSucceeded(id:streamCapacity:) ObjectIdentifier(0x0000000130a0ed10) 128
🌊: notifications connectionUtilizationChanged(id:streamsUsed:streamCapacity:) ObjectIdentifier(0x0000000130a0ed10) 1 128
[DATADOG SDK] 🐶 → 16:48:30.664 💡 (logging) No upload. Batch to upload: NO, System conditions: ✅
[DATADOG SDK] 🐶 → 16:49:25.670 💡 (logging) No upload. Batch to upload: NO, System conditions: ✅
🌊: transaction-history connectionClosed(id:error:) Optional(NIOHTTP2.NIOHTTP2Errors.NoSuchStream(streamID: HTTP2StreamID(1), location: "NIOHTTP2/ConnectionStreamsState.swift:305"))
🌊: transaction-history startedConnecting(id:) ObjectIdentifier(0x000000016b13ea40)
🌊: transaction-history connectionClosed(id:error:) nil
🌊: transaction-history connectionRemoved(id:) ObjectIdentifier(0x000000016b13ea40)
[DATADOG SDK] 🐶 → 16:50:20.668 💡 (logging) No upload. Batch to upload: NO, System conditions: ✅
{... repetitive DataDog logs removed ...}
[DATADOG SDK] 🐶 → 16:57:30.656 💡 (logging) No upload. Batch to upload: NO, System conditions: ✅
2023-05-24 16:58:09.826837-0400 Otter-QA[58272:544319] Disconnected from Notification Service listener! Retrying...: Invalid HTTP response status, but gRPC status was present [NotificationClientGRPC.swift:136]
🌊: notifications connectionClosed(id:error:) Optional(NIOHTTP2.NIOHTTP2Errors.NoSuchStream(streamID: HTTP2StreamID(1), location: "NIOHTTP2/ConnectionStreamsState.swift:305"))
GRPC/PoolManagerStateMachine+PerPoolState.swift:57: Assertion failed
2023-05-24 16:58:09.828524-0400 Otter-QA[58272:544317] GRPC/PoolManagerStateMachine+PerPoolState.swift:57: Assertion failed

I added the 🌊: {server_name} prefix to your delegate example to make them stand out a bit more. At some point our load balancer config changed so the timeout is 10 min now, but this is the activity from a successfully handled timeout to a failed one. I left some of the DataDog logs in for timestamps.

A bit more context, we're running multiple GRPC connections to multiple servers in the app, they each have their own GRPCChannelPool, but they share an EventLoopGroup. Only "notifications" is using streaming requests, all the other connections are just open and close RPCs, so in the case of this log the "transaction-history" client made a successful request shortly after the "notifications" client restarted.

Anything else I can try or provide?

glbrntt added a commit to glbrntt/grpc-swift that referenced this issue May 26, 2023
Motivation:

The connection pool manager manages a pool of connections per
event-loop. It spreads load across these pools by tracking how many
streams a pool has capacity for and how many streams are in use. To
facilitate this each pool reports back to the pool manager when streams
have been reserved and when they have been returned.

If connections are closed unexpectedly (due to an error, for example)
then the pool reports this in bulk. However when the streams are closed
they are also reported back to the pool manager. This means the manager
can end up thinking a pool has a negative number of reserved streams
which results in an assertion failure.

Modifications:

- Check if the connection a stream is being returned to is available
  before reporting stream closures to the pool manager.

Result:

- Better stream accounting.
- Resolved grpc#1598
@glbrntt
Copy link
Collaborator

glbrntt commented May 26, 2023

Thanks for providing that. I think I've tracked down the bug and have a fix in #1603.

Lukasa pushed a commit that referenced this issue May 26, 2023
…#1603)

Motivation:

The connection pool manager manages a pool of connections per
event-loop. It spreads load across these pools by tracking how many
streams a pool has capacity for and how many streams are in use. To
facilitate this each pool reports back to the pool manager when streams
have been reserved and when they have been returned.

If connections are closed unexpectedly (due to an error, for example)
then the pool reports this in bulk. However when the streams are closed
they are also reported back to the pool manager. This means the manager
can end up thinking a pool has a negative number of reserved streams
which results in an assertion failure.

Modifications:

- Check if the connection a stream is being returned to is available
  before reporting stream closures to the pool manager.

Result:

- Better stream accounting.
- Resolved #1598
@edelabar
Copy link
Author

@glbrntt Sorry for the delay, but I just confirmed against the main branch and it looks like it's working for me. Thanks for the help! ETA on a versioned release?

@glbrntt
Copy link
Collaborator

glbrntt commented May 30, 2023

Awesome, thanks for confirming. Likely to be end of week / early next week.

@edelabar
Copy link
Author

Thanks!

WendellXY pushed a commit to sundayfun/grpc-swift that referenced this issue Aug 24, 2023
…grpc#1603)

Motivation:

The connection pool manager manages a pool of connections per
event-loop. It spreads load across these pools by tracking how many
streams a pool has capacity for and how many streams are in use. To
facilitate this each pool reports back to the pool manager when streams
have been reserved and when they have been returned.

If connections are closed unexpectedly (due to an error, for example)
then the pool reports this in bulk. However when the streams are closed
they are also reported back to the pool manager. This means the manager
can end up thinking a pool has a negative number of reserved streams
which results in an assertion failure.

Modifications:

- Check if the connection a stream is being returned to is available
  before reporting stream closures to the pool manager.

Result:

- Better stream accounting.
- Resolved grpc#1598
pinlin168 pushed a commit to sundayfun/grpc-swift that referenced this issue Aug 24, 2023
…grpc#1603)

Motivation:

The connection pool manager manages a pool of connections per
event-loop. It spreads load across these pools by tracking how many
streams a pool has capacity for and how many streams are in use. To
facilitate this each pool reports back to the pool manager when streams
have been reserved and when they have been returned.

If connections are closed unexpectedly (due to an error, for example)
then the pool reports this in bulk. However when the streams are closed
they are also reported back to the pool manager. This means the manager
can end up thinking a pool has a negative number of reserved streams
which results in an assertion failure.

Modifications:

- Check if the connection a stream is being returned to is available
  before reporting stream closures to the pool manager.

Result:

- Better stream accounting.
- Resolved grpc#1598
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Adopter support requests.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants