Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-44693: Revert "disable ResilientWatchCacheInitialization feature" #2192

Conversation

benluddy
Copy link

@benluddy benluddy commented Jan 30, 2025

When this feature is enabled, watch requests that are to be served from the watch cache immediately return 429 if the cache is not initialized and the client retries. When disabled, the same watch requests "hang" until they either time out or complete successfully.

There is an OCP test that counts the number of watch requests during a job on a per-user basis by scraping audit logs. The test fails if a user exceeds an arbitrary threshold that has been selected based on historical observations. With this feature enabled, any issue that delays watch cache initialization or forces a watch cache to reinitialize now results in an increase in the number of watch requests appearing in the audit logs (due to the retries), which in turn causes the test thresholds to breach.

This was temporarily disabled for kube-apiserver to improve the CI signal-to-noise ratio during the 1.31 rebase. It was not disabled for openshift-apiserver.

Sample job from the 1.31 rebase process before the feature was disabled: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-cluster-kube-apiserver-operator-1734-openshift-kubernetes-2055-openshift-cluster-kube-apiserver-operator-1734-nightly-4.18-e2e-aws-ovn-single-node-serial/1835775665903767552

@openshift-ci-robot openshift-ci-robot added the backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. label Jan 30, 2025
@openshift-ci-robot
Copy link

@benluddy: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@benluddy
Copy link
Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 30, 2025
@openshift-ci openshift-ci bot requested review from jerpeter1 and tkashem January 30, 2025 22:25
@openshift-ci openshift-ci bot added the vendor-update Touching vendor dir or related files label Jan 30, 2025
@benluddy benluddy force-pushed the drop-temporary-disablement-resilientwatchcacheinitialization branch from 3a3459a to c07d5a2 Compare March 10, 2025 19:20
@benluddy
Copy link
Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 10, 2025
@openshift-ci-robot
Copy link

@benluddy: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2025
@benluddy
Copy link
Author

/assign @bertinatto

@benluddy
Copy link
Author

/cc @p0lyn0mial

@openshift-ci openshift-ci bot requested a review from p0lyn0mial March 10, 2025 19:21
@benluddy benluddy changed the title Revert "disable ResilientWatchCacheInitialization feature" OCPBUGS-44693: Revert "disable ResilientWatchCacheInitialization feature" Mar 10, 2025
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Mar 10, 2025
@openshift-ci-robot
Copy link

@benluddy: This pull request references Jira Issue OCPBUGS-44693, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @gangwgr

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

When this feature is enabled, watch requests that are to be served from the watch cache immediately return 429 if the cache is not initialized and the client retries. When disabled, the same watch requests "hang" until they either time out or complete successfully.

There is an OCP test that counts the number of watch requests during a job on a per-user basis by scraping audit logs. The test fails if a user exceeds an arbitrary threshold that has been selected based on historical observations. With this feature enabled, any issue that delays watch cache initialization or forces a watch cache to reinitialize now results in an increase in the number of watch requests appearing in the audit logs (due to the retries), which in turn causes the test thresholds to breach.

This was temporarily disabled for kube-apiserver to improve the CI signal-to-noise ratio during the 1.31 rebase. It was not disabled for openshift-apiserver.

Sample job from the 1.31 rebase process before the feature was disabled: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-cluster-kube-apiserver-operator-1734-openshift-kubernetes-2055-openshift-cluster-kube-apiserver-operator-1734-nightly-4.18-e2e-aws-ovn-single-node-serial/1835775665903767552

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from gangwgr March 10, 2025 20:13
Copy link

openshift-ci bot commented Mar 10, 2025

@benluddy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify-commits c07d5a2 link true /test verify-commits

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@bertinatto
Copy link
Member

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-ha-dualstack-conformance 10

Copy link

openshift-ci bot commented Mar 10, 2025

@bertinatto: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-ha-dualstack-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7cab8890-fe06-11ef-9ce0-fd46b737721c-0

@bertinatto
Copy link
Member

/payload 4.19 nightly blocking

Copy link

openshift-ci bot commented Mar 10, 2025

@bertinatto: trigger 12 job(s) of type blocking for the nightly release of OCP 4.19

  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-upgrade-ovn-single-node
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.19-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-techpreview-serial
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-driver-toolkit
  • periodic-ci-openshift-release-master-nightly-4.19-fips-payload-scan
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-bm
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8d03f380-fe06-11ef-96de-41de0e739063-0

@bertinatto
Copy link
Member

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-ha-dualstack-conformance 10

The previous run failed with:

time="2025-03-11T04:42:10.099Z" level=info msg="found 0 finished jobRuns:  and 0 unfinished jobRuns: "

Copy link

openshift-ci bot commented Mar 11, 2025

@bertinatto: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-ha-dualstack-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9ae043d0-fe67-11ef-8c58-f8ae53f72ac7-0

@bertinatto
Copy link
Member

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-ha-dualstack-conformance 10

Copy link

openshift-ci bot commented Mar 11, 2025

@bertinatto: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-ha-dualstack-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6a170ce0-fea1-11ef-8cb8-136ff5981dc6-0

@bertinatto
Copy link
Member

/payload 4.19 nightly informing

@openshift openshift deleted a comment from openshift-ci bot Mar 11, 2025
@openshift openshift deleted a comment from openshift-ci bot Mar 11, 2025
@openshift openshift deleted a comment from openshift-ci bot Mar 11, 2025
@bertinatto
Copy link
Member

/payload 4.19 nightly informing

Copy link

openshift-ci bot commented Mar 11, 2025

@bertinatto: trigger 65 job(s) of type informing for the nightly release of OCP 4.19

  • periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-compact-fips
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-ha-dualstack-conformance
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-single-node-ipv6
  • periodic-ci-openshift-release-master-nightly-4.19-console-aws
  • periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.19-periodics-e2e-aws
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-csi
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-cgroupsv2
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-fips
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node-csi
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node-serial
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node-techpreview-serial
  • periodic-ci-openshift-release-master-nightly-4.19-upgrade-from-stable-4.18-e2e-aws-upgrade-ovn-single-node
  • periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-upgrade-out-of-change
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-upi
  • periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.19-periodics-e2e-azure
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-azure-csi
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-techpreview
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-techpreview-serial
  • periodic-ci-openshift-release-master-ci-4.19-e2e-azure-ovn-upgrade-out-of-change
  • periodic-ci-openshift-release-master-cnv-nightly-4.19-deploy-azure-kubevirt-ovn
  • periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.19-periodics-e2e-gcp
  • periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-csi
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-rt
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn-techpreview
  • periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn-techpreview-serial
  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.19-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-bm-upgrade
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-dualstack
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-dualstack-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-ipv6-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-serial-ipv4
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-serial-virtualmedia
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-upgrade-from-stable-4.18-e2e-metal-ipi-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-serial-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-serial-ovn-dualstack
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-upgrade-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.19-upgrade-from-stable-4.18-e2e-metal-ipi-upgrade-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ovn-assisted
  • periodic-ci-openshift-release-master-nightly-4.19-metal-ovn-single-node-recert-cluster-rename
  • periodic-ci-openshift-osde2e-main-nightly-4.19-osd-aws
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-osd-ccs-gcp
  • periodic-ci-openshift-osde2e-main-nightly-4.19-osd-gcp
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-proxy
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ovn-single-node-live-iso
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-telco5g
  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-csi
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-serial
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-techpreview-serial
  • periodic-ci-openshift-release-master-ci-4.19-e2e-vsphere-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-vsphere-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-upi
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-ovn-upi-serial
  • periodic-ci-openshift-release-master-nightly-4.19-e2e-vsphere-static-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b1236530-feb4-11ef-8fa4-da9e74b5539a-0

@bertinatto
Copy link
Member

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-ha-dualstack-conformance 10

Copy link

openshift-ci bot commented Mar 12, 2025

@bertinatto: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-ha-dualstack-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b7405660-fed9-11ef-9e61-89eec6697b98-0

@benluddy benluddy force-pushed the drop-temporary-disablement-resilientwatchcacheinitialization branch from c07d5a2 to 12b2c09 Compare March 12, 2025 13:02
@openshift-ci-robot
Copy link

@benluddy: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@bertinatto
Copy link
Member

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-ha-dualstack-conformance 10

Copy link

openshift-ci bot commented Mar 12, 2025

@bertinatto: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-ha-dualstack-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c882c300-ff64-11ef-8455-d7605d0b6295-0

@bertinatto
Copy link
Member

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-ha-dualstack-conformance 10

Copy link

openshift-ci bot commented Mar 13, 2025

@bertinatto: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-ha-dualstack-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ccb6b400-fff9-11ef-9b2a-c23d7650b39f-0

@bertinatto
Copy link
Member

/retest-required

@bertinatto
Copy link
Member

  1. I manually checked all nightly blocking/informing jobs, and none of them failed the [sig-arch][Late] operators should not create watch channels very often test.
  2. The aggregated periodic-ci-openshift-release-master-nightly-4.19-e2e-agent-ha-dualstack-conformance, which was previously close to perma-failing, is green as well.
  3. Component Readiness also shows no regressions.

Based on this, I agreed with TRT we should give it a shot. We'll revert if things get bad.

/remove-label backports/unvalidated-commits
/label backports/validated-commits
/lgtm

@openshift-ci openshift-ci bot added backports/validated-commits Indicates that all commits come to merged upstream PRs. lgtm Indicates that a PR is ready to be merged. and removed backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. labels Mar 13, 2025
Copy link

openshift-ci bot commented Mar 13, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benluddy, bertinatto

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit d6f2dd2 into openshift:master Mar 13, 2025
20 of 22 checks passed
@openshift-ci-robot
Copy link

@benluddy: Jira Issue OCPBUGS-44693: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-44693 has not been moved to the MODIFIED state.

In response to this:

When this feature is enabled, watch requests that are to be served from the watch cache immediately return 429 if the cache is not initialized and the client retries. When disabled, the same watch requests "hang" until they either time out or complete successfully.

There is an OCP test that counts the number of watch requests during a job on a per-user basis by scraping audit logs. The test fails if a user exceeds an arbitrary threshold that has been selected based on historical observations. With this feature enabled, any issue that delays watch cache initialization or forces a watch cache to reinitialize now results in an increase in the number of watch requests appearing in the audit logs (due to the retries), which in turn causes the test thresholds to breach.

This was temporarily disabled for kube-apiserver to improve the CI signal-to-noise ratio during the 1.31 rebase. It was not disabled for openshift-apiserver.

Sample job from the 1.31 rebase process before the feature was disabled: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-cluster-kube-apiserver-operator-1734-openshift-kubernetes-2055-openshift-cluster-kube-apiserver-operator-1734-nightly-4.18-e2e-aws-ovn-single-node-serial/1835775665903767552

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-pod
This PR has been included in build openshift-enterprise-pod-container-v4.19.0-202503131810.p0.gd6f2dd2.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

Distgit: kube-proxy
This PR has been included in build kube-proxy-container-v4.19.0-202503131810.p0.gd6f2dd2.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-hyperkube
This PR has been included in build openshift-enterprise-hyperkube-container-v4.19.0-202503131810.p0.gd6f2dd2.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

Distgit: ose-installer-kube-apiserver-artifacts
This PR has been included in build ose-installer-kube-apiserver-artifacts-container-v4.19.0-202503131810.p0.gd6f2dd2.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backports/validated-commits Indicates that all commits come to merged upstream PRs. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. vendor-update Touching vendor dir or related files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants