-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-23514: Failing=Unknown upon long CO updating #1165
base: main
Are you sure you want to change the base?
Conversation
@hongkailiu: This pull request references Jira Issue OCPBUGS-23514, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@hongkailiu: This pull request references Jira Issue OCPBUGS-23514, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
cc @dis016 |
bd6ed4c
to
577f975
Compare
Testing with
and build1: and build2:
### upgrade to build1
$ oc adm upgrade --to-image registry.build06.ci.openshift.org/ci-ln-nmgvdzt/release:latest --force --allow-explicit-upgrade
$ oc get clusterversion version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.19.0-ec.2 True True 41m Working towards 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest:
711 of 903 done (78% complete), waiting on image-registry
### upgrade to build2
$ oc adm upgrade --to-image registry.build06.ci.openshift.org/ci-ln-zx89yxb/release:latest --force --allow-explicit-upgrade --allow-upgrade-with-warnings
### Issue1: After a couple of mins, we see "longer than expected" on etcd and kube-apiserver.
$ oc get clusterversion version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.19.0-ec.2 True True 47m Working towards 4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest:
111 of 903 done (12% complete), waiting on etcd, kube-apiserver over 30 minutes which is longer than expected
### Be patient: Expected result showed up.
$ oc get clusterversion version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.19.0-ec.2 True True 99m Working towards 4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest: 711 of 903 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected
### Issue2: status-command showed nothing about image-registry.
$ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status --details=operators
Unable to fetch alerts, ignoring alerts in 'Update Health': failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment: Progressing
Target Version: 4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest (from incomplete 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest)
Completion: 85% (29 operators updated, 0 updating, 5 waiting)
Duration: 56m (Est. Time Remaining: 1h22m)
Operator Health: 34 Healthy
Control Plane Nodes
NAME ASSESSMENT PHASE VERSION EST MESSAGE
ip-10-0-11-19.ec2.internal Outdated Pending 4.19.0-ec.2 ?
ip-10-0-26-148.ec2.internal Outdated Pending 4.19.0-ec.2 ?
ip-10-0-95-12.ec2.internal Outdated Pending 4.19.0-ec.2 ?
= Worker Upgrade =
WORKER POOL ASSESSMENT COMPLETION STATUS
worker Pending 0% (0/3) 3 Available, 0 Progressing, 0 Draining
Worker Pool Nodes: worker
NAME ASSESSMENT PHASE VERSION EST MESSAGE
ip-10-0-24-13.ec2.internal Outdated Pending 4.19.0-ec.2 ?
ip-10-0-54-250.ec2.internal Outdated Pending 4.19.0-ec.2 ?
ip-10-0-80-23.ec2.internal Outdated Pending 4.19.0-ec.2 ?
= Update Health =
SINCE LEVEL IMPACT MESSAGE
55m56s Warning None Previous update to 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest never completed, last complete update was 4.19.0-ec.2
Run with --details=health for additional description and links to related online documentation
$ oc get co kube-apiserver image-registry
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
kube-apiserver 4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest True False False 130m
image-registry 4.19.0-ec.2 True False False 124m |
54e9f08
to
34a5a59
Compare
Triggered build3: build 4.19,openshift/cluster-image-registry-operator#1184,openshift/cluster-version-operator#1165 Repeated the test, updating to build1 and then build3: $ oc adm upgrade --to-image registry.build06.ci.openshift.org/ci-ln-r67x3s2/release:latest --force --allow-explicit-upgrade --allow-upgrade-with-warnings
### the non-zero guard on the CO update start times seems working as issue1 is gone
$ oc get clusterversion version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.19.0-ec.2 True True 34m Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest: 111 of 903 done (12% complete), waiting on etcd, kube-apiserver
$ oc get clusterversion version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.19.0-ec.2 True True 75m Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest:
711 of 903 done (78% complete), waiting on image-registry
### be paticent and there it goes
$ oc get clusterversion version
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.19.0-ec.2 True True 90m Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest: 711 of 903 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected
$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest: 711 of 903 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected
Upgradeable=False
Reason: UpdateInProgress
Message: An update is already in progress and the details are in the Progressing condition
Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.19
warning: Cannot display available updates:
Reason: VersionNotFound
Message: Unable to retrieve available updates: currently reconciling cluster version 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest not found in the "candidate-4.19" channel
$ oc get co kube-apiserver image-registry
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
kube-apiserver 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest True False False 121m
image-registry 4.19.0-ec.2 True False False 113m
### issue2 is still there as expected
$ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status --details=all
Unable to fetch alerts, ignoring alerts in 'Update Health': failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment: Progressing
Target Version: 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest (from incomplete 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest)
Completion: 85% (29 operators updated, 0 updating, 5 waiting)
Duration: 57m (Est. Time Remaining: 18m)
Operator Health: 34 Healthy
Control Plane Nodes
NAME ASSESSMENT PHASE VERSION EST MESSAGE
ip-10-0-11-67.ec2.internal Outdated Pending 4.19.0-ec.2 ?
ip-10-0-34-60.ec2.internal Outdated Pending 4.19.0-ec.2 ?
ip-10-0-71-225.ec2.internal Outdated Pending 4.19.0-ec.2 ?
= Worker Upgrade =
WORKER POOL ASSESSMENT COMPLETION STATUS
worker Pending 0% (0/3) 3 Available, 0 Progressing, 0 Draining
Worker Pool Nodes: worker
NAME ASSESSMENT PHASE VERSION EST MESSAGE
ip-10-0-24-151.ec2.internal Outdated Pending 4.19.0-ec.2 ?
ip-10-0-57-13.ec2.internal Outdated Pending 4.19.0-ec.2 ?
ip-10-0-97-93.ec2.internal Outdated Pending 4.19.0-ec.2 ?
= Update Health =
Message: Previous update to 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest never completed, last complete update was 4.19.0-ec.2
Since: 56m59s
Level: Warning
Impact: None
Reference: https://docs.openshift.com/container-platform/latest/updating/troubleshooting_updates/gathering-data-cluster-update.html#gathering-clusterversion-history-cli_troubleshooting_updates
Resources:
clusterversions.config.openshift.io: version
Description: Current update to 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest was initiated while the previous update to version 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest was still in progress I think issue 2 above is caused by |
Yeah, I agree - the crash is actually the easy case, because then there is at least some symptom that can be noticed and it is somewhat clear that if CVO says it is waiting for an
Surfacing the condition in the message makes the situation slightly better, but my concern is that we cannot easily consume this data for Status API / command. I want I'd like to come up with at least something. Personally I would do
I do not want this for the reasons above. |
I am going to do the following (the other three options above may lead to questions/trouble from users. I might come back to them if neither of the following goes thro :knock :knock :knock):
|
SGTM |
34a5a59
to
1c87755
Compare
@hongkailiu: This pull request references Jira Issue OCPBUGS-23514, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Triggered a build with:
Then upgrade from a cluster of version
The "failing" in Message is not very precise. But I think we can fix it in the status cmd. Also get some screenshots at the time of "longer than expected": ![]() ![]() ![]() |
/test okd-scos-e2e-aws-ovn |
[cvo#1165](openshift/cluster-version-operator#1165) introduced uncertainty of the Failing condition (`Failing=Unknown`). This PR adjusts the impact summary in updateInsight accordingly.
[cvo#1165](openshift/cluster-version-operator#1165) introduced uncertainty of the Failing condition (`Failing=Unknown`). This PR adjusts the impact summary in updateInsight accordingly.
1c87755
to
ba6bfdc
Compare
When it takes too long (90m+ for machine-config and 30m+ for others) to upgrade a cluster operator, clusterversion shows a message with the indication that the upgrade might hit some issue. This will cover the case in the related OCPBUGS-23538: for some reason, the pod under the deployment that manages the CO hit CrashLoopBackOff. Deployment controller does not give useful conditions in this situation [1]. Otherwise, checkDeploymentHealth [2] would detect it. Instead of CVO's figuring out the underlying pod's CrashLoopBackOff which might be better to be implemented by deployment controller, it is expected that our cluster admin starts to dig into the cluster when such a message pops up. In addition to the condition's message. We propagate Fail=Unknown to make it available for other automations, such as update-status command. [1]. kubernetes/kubernetes#106054 [2]. https://github.com/openshift/cluster-version-operator/blob/08c0459df5096e9f16fad3af2831b62d06d415ee/lib/resourcebuilder/apps.go#L79-L136
ba6bfdc
to
ba7f4f7
Compare
/retest-required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should not communicate through the condition message but through some of the machine-readable fields (or introduce explicit data to pass) #1165 (comment)
01cef29
to
292e409
Compare
292e409
to
8892f42
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongkailiu, petr-muller The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks! I think this does not meet the Hongkai did some decent testing #1165 (comment) |
/retest |
@hongkailiu: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Hongkai's test SGTM, especially the break-upgrade way from Petr's pr is wonderful and enlightening. After reading through the bug and pr, following test scenarios comes into my mind.
^^ is my personal understanding, so leave it to Dinesh for more thoughts on final pre-merge test. |
Thanks @jiajliu i also feel the above scenario's are good for this bug. |
When it takes too long (90m+ for machine-config and 30m+ for
others) to upgrade a cluster operator, clusterversion shows
a message with the indication that the upgrade might hit
some issue.
This will cover the case in the related OCPBUGS-23538: for some
reason, the pod under the deployment that manages the CO hit
CrashLoopBackOff. Deployment controller does not give useful
conditions in this situation [1]. Otherwise, checkDeploymentHealth [2]
would detect it.
Instead of CVO's figuring out the underlying pod's
CrashLoopBackOff which might be better to be implemented by
deployment controller, it is expected that our cluster admin
starts to dig into the cluster when such a message pops up.
In addition to the condition's message. We propagate Fail=Unknown
to make it available for other automations, such as update-status
command.
[1]. kubernetes/kubernetes#106054
[2].
cluster-version-operator/lib/resourcebuilder/apps.go
Lines 79 to 136 in 08c0459