Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bump grpc to 1.60.1, reconnect idle connections #3147

Merged

Conversation

joelanford
Copy link
Member

Description of the change:
When we detect that the catalog operator's GPRC client connections become IDLE, we now always attempt to reconnect. This wouldn't typically be necessary for GRPC connections (GRPC clients automatically reconnect to the server when the next RPC call is made). However OLM reports the state of the GRPC connection in the CatalogSource API and makes assumptions about the connection being READY (or not).

IDLE connections are sort of in an unknown limbo. For example, one reason for an idle connection is if the server disconnects. In that case the client doesn't actually know if the server is still there until it tries to reconnect. Another reason is if the client has not made an RPC call within the idle timeout duration. There is no way to distinguish looking at the connection state the reason that the connection is idle. OLM clients want to know whether the server is actually present and connected, so this PR ensures that we attempt re-establishment of the TCP connection when the GRPC connection goes idle.

Motivation for the change:
Fixes #3146

Also unblocks us from upgrading other dependencies that are incompatible with grpc v1.40

Architectural changes:
None

Testing remarks:

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Bug fixes are accompanied by regression test(s)
  • e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
  • tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
  • Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky and have an issue
  • Code is properly formatted

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 5, 2024
@@ -228,6 +228,11 @@ func (s *SourceStore) watch(ctx context.Context, key registry.CatalogKey, source
s.sources[key] = *src
s.sourcesLock.Unlock()

// always try to reconnect if the connection goes idle
if newState == connectivity.Idle {
source.Conn.Connect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this already no-ops if the underlying connections are not idle, so it might be better to call this unconditionally. A comment to note that this is non-blocking would be useful as well as it impacts the understanding of what consumers of s.notify see.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying this in #3144, which should bring OLM to the minimal version of GRPC necessary for the next set of updates

stevekuznetsov
stevekuznetsov previously approved these changes Jan 5, 2024
Copy link

openshift-ci bot commented Jan 8, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: joelanford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1 similar comment
Copy link

openshift-ci bot commented Jan 8, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: joelanford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@stevekuznetsov stevekuznetsov added this pull request to the merge queue Jan 8, 2024
Merged via the queue into operator-framework:master with commit e6d8fb0 Jan 8, 2024
@joelanford joelanford deleted the fix-flake-3146 branch January 8, 2024 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files.
Projects
None yet
3 participants