Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-23514: Failing=Unknown upon long CO updating #1165

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

hongkailiu
Copy link
Member

@hongkailiu hongkailiu commented Feb 28, 2025

When it takes too long (90m+ for machine-config and 30m+ for
others) to upgrade a cluster operator, clusterversion shows
a message with the indication that the upgrade might hit
some issue.

This will cover the case in the related OCPBUGS-23538: for some
reason, the pod under the deployment that manages the CO hit
CrashLoopBackOff. Deployment controller does not give useful
conditions in this situation [1]. Otherwise, checkDeploymentHealth [2]
would detect it.

Instead of CVO's figuring out the underlying pod's
CrashLoopBackOff which might be better to be implemented by
deployment controller, it is expected that our cluster admin
starts to dig into the cluster when such a message pops up.

In addition to the condition's message. We propagate Fail=Unknown
to make it available for other automations, such as update-status
command.

[1]. kubernetes/kubernetes#106054

[2].

func (b *builder) checkDeploymentHealth(ctx context.Context, deployment *appsv1.Deployment) error {
if b.mode == InitializingMode {
return nil
}
iden := fmt.Sprintf("%s/%s", deployment.Namespace, deployment.Name)
if deployment.DeletionTimestamp != nil {
return fmt.Errorf("deployment %s is being deleted", iden)
}
var availableCondition *appsv1.DeploymentCondition
var progressingCondition *appsv1.DeploymentCondition
var replicaFailureCondition *appsv1.DeploymentCondition
for idx, dc := range deployment.Status.Conditions {
switch dc.Type {
case appsv1.DeploymentProgressing:
progressingCondition = &deployment.Status.Conditions[idx]
case appsv1.DeploymentAvailable:
availableCondition = &deployment.Status.Conditions[idx]
case appsv1.DeploymentReplicaFailure:
replicaFailureCondition = &deployment.Status.Conditions[idx]
}
}
if replicaFailureCondition != nil && replicaFailureCondition.Status == corev1.ConditionTrue {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s has some pods failing; unavailable replicas=%d", iden, deployment.Status.UnavailableReplicas),
Reason: "WorkloadNotProgressing",
Message: fmt.Sprintf("deployment %s has a replica failure %s: %s", iden, replicaFailureCondition.Reason, replicaFailureCondition.Message),
Name: iden,
}
}
if availableCondition != nil && availableCondition.Status == corev1.ConditionFalse && progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s is not available and not progressing; updated replicas=%d of %d, available replicas=%d of %d", iden, deployment.Status.UpdatedReplicas, deployment.Status.Replicas, deployment.Status.AvailableReplicas, deployment.Status.Replicas),
Reason: "WorkloadNotAvailable",
Message: fmt.Sprintf("deployment %s is not available %s (%s) or progressing %s (%s)", iden, availableCondition.Reason, availableCondition.Message, progressingCondition.Reason, progressingCondition.Message),
Name: iden,
}
}
if progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse && progressingCondition.Reason == "ProgressDeadlineExceeded" {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),
Reason: "WorkloadNotProgressing",
Message: fmt.Sprintf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),
Name: iden,
}
}
if availableCondition == nil && progressingCondition == nil && replicaFailureCondition == nil {
klog.Warningf("deployment %s is not setting any expected conditions, and is therefore in an unknown state", iden)
}
return nil
}

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 28, 2025
@openshift-ci-robot
Copy link
Contributor

@hongkailiu: This pull request references Jira Issue OCPBUGS-23514, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

When it takes too long (90m+ for machine-config and 30m+ for others) to upgrade a cluster operator, clusterversion shows a message with the indication that the upgrade might hit some issue.

This will cover the case in the related OCPBUGS-23538: for some reason, the pod under the deployment that manages the CO hit CrashLoopBackOff. Deployment controller does not give useful conditions in this situation [1]. Otherwise, checkDeploymentHealth [2] would detect it.

Instead of CVO's figuring out the underlying pod's CrashLoopBackOff which might be better to be implemented by deployment controller, it is expected that our cluster admin starts to dig into the cluster when such a message pops up.

For now, we just modify the condition's message. We could propagate Fail=True in case such requirements are collected from customers.

[1]. kubernetes/kubernetes#106054

[2].

func (b *builder) checkDeploymentHealth(ctx context.Context, deployment *appsv1.Deployment) error {
if b.mode == InitializingMode {
return nil
}
iden := fmt.Sprintf("%s/%s", deployment.Namespace, deployment.Name)
if deployment.DeletionTimestamp != nil {
return fmt.Errorf("deployment %s is being deleted", iden)
}
var availableCondition *appsv1.DeploymentCondition
var progressingCondition *appsv1.DeploymentCondition
var replicaFailureCondition *appsv1.DeploymentCondition
for idx, dc := range deployment.Status.Conditions {
switch dc.Type {
case appsv1.DeploymentProgressing:
progressingCondition = &deployment.Status.Conditions[idx]
case appsv1.DeploymentAvailable:
availableCondition = &deployment.Status.Conditions[idx]
case appsv1.DeploymentReplicaFailure:
replicaFailureCondition = &deployment.Status.Conditions[idx]
}
}
if replicaFailureCondition != nil && replicaFailureCondition.Status == corev1.ConditionTrue {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s has some pods failing; unavailable replicas=%d", iden, deployment.Status.UnavailableReplicas),
Reason: "WorkloadNotProgressing",
Message: fmt.Sprintf("deployment %s has a replica failure %s: %s", iden, replicaFailureCondition.Reason, replicaFailureCondition.Message),
Name: iden,
}
}
if availableCondition != nil && availableCondition.Status == corev1.ConditionFalse && progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s is not available and not progressing; updated replicas=%d of %d, available replicas=%d of %d", iden, deployment.Status.UpdatedReplicas, deployment.Status.Replicas, deployment.Status.AvailableReplicas, deployment.Status.Replicas),
Reason: "WorkloadNotAvailable",
Message: fmt.Sprintf("deployment %s is not available %s (%s) or progressing %s (%s)", iden, availableCondition.Reason, availableCondition.Message, progressingCondition.Reason, progressingCondition.Message),
Name: iden,
}
}
if progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse && progressingCondition.Reason == "ProgressDeadlineExceeded" {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),
Reason: "WorkloadNotProgressing",
Message: fmt.Sprintf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),
Name: iden,
}
}
if availableCondition == nil && progressingCondition == nil && replicaFailureCondition == nil {
klog.Warningf("deployment %s is not setting any expected conditions, and is therefore in an unknown state", iden)
}
return nil
}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from petr-muller and wking February 28, 2025 01:56
@hongkailiu hongkailiu changed the title OCPBUGS-23514: Better a message upon long CO updating [wip]OCPBUGS-23514: Better a message upon long CO updating Feb 28, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 28, 2025
@hongkailiu
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 28, 2025
@openshift-ci-robot
Copy link
Contributor

@hongkailiu: This pull request references Jira Issue OCPBUGS-23514, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jiajliu

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from jiajliu February 28, 2025 01:58
@jiajliu
Copy link

jiajliu commented Feb 28, 2025

cc @dis016

@hongkailiu
Copy link
Member Author

Testing with

launch 4.19.0-ec.2 aws

and build1:

build 4.19,openshift/cluster-image-registry-operator#1184

and build2:

build 4.19,openshift/cluster-image-registry-operator#1184,#1165

### upgrade to build1
$ oc adm upgrade --to-image registry.build06.ci.openshift.org/ci-ln-nmgvdzt/release:latest --force --allow-explicit-upgrade

$ oc get clusterversion version                                               
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          41m     Working towards 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest:
 711 of 903 done (78% complete), waiting on image-registry

### upgrade to build2
$ oc adm upgrade --to-image registry.build06.ci.openshift.org/ci-ln-zx89yxb/release:latest --force --allow-explicit-upgrade --allow-upgrade-with-warnings

### Issue1: After a couple of mins, we see "longer than expected" on etcd and kube-apiserver.
$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          47m     Working towards 4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest:
 111 of 903 done (12% complete), waiting on etcd, kube-apiserver over 30 minutes which is longer than expected

### Be patient: Expected result showed up.
$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          99m     Working towards 4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest: 711 of 903 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected

### Issue2: status-command showed nothing about image-registry.
$ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status --details=operators
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest (from incomplete 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest)
Completion:      85% (29 operators updated, 0 updating, 5 waiting)
Duration:        56m (Est. Time Remaining: 1h22m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
ip-10-0-11-19.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-26-148.ec2.internal   Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-95-12.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
ip-10-0-24-13.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-54-250.ec2.internal   Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-80-23.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?

= Update Health =
SINCE    LEVEL     IMPACT   MESSAGE
55m56s   Warning   None     Previous update to 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest never completed, last complete update was 4.19.0-ec.2

Run with --details=health for additional description and links to related online documentation

$ oc get co kube-apiserver image-registry
NAME             VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.19.0-0.test-2025-02-28-152027-ci-ln-zx89yxb-latest   True        False         False      130m
image-registry   4.19.0-ec.2                                            True        False         False      124m

@hongkailiu hongkailiu force-pushed the OCPBUGS-23514 branch 5 times, most recently from 54e9f08 to 34a5a59 Compare March 1, 2025 05:12
@hongkailiu
Copy link
Member Author

Triggered build3:

build 4.19,openshift/cluster-image-registry-operator#1184,openshift/cluster-version-operator#1165

Repeated the test, updating to build1 and then build3:

$ oc adm upgrade --to-image registry.build06.ci.openshift.org/ci-ln-r67x3s2/release:latest --force --allow-explicit-upgrade --allow-upgrade-with-warnings

### the non-zero guard on the CO update start times seems working as issue1 is gone
$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          34m     Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest: 111 of 903 done (12% complete), waiting on etcd, kube-apiserver

$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          75m     Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest:
711 of 903 done (78% complete), waiting on image-registry

### be paticent and there it goes
$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          90m     Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest: 711 of 903 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected

$ oc adm upgrade
info: An upgrade is in progress. Working towards 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest: 711 of 903 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected

Upgradeable=False

  Reason: UpdateInProgress
  Message: An update is already in progress and the details are in the Progressing condition

Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.19
warning: Cannot display available updates:
  Reason: VersionNotFound
  Message: Unable to retrieve available updates: currently reconciling cluster version 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest not found in the "candidate-4.19" channel

$ oc get co kube-apiserver image-registry
NAME             VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-apiserver   4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest   True        False         False      121m
image-registry   4.19.0-ec.2                                            True        False         False      113m

### issue2 is still there as expected
$ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status --details=all
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest (from incomplete 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest)
Completion:      85% (29 operators updated, 0 updating, 5 waiting)
Duration:        57m (Est. Time Remaining: 18m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
ip-10-0-11-67.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-34-60.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-71-225.ec2.internal   Outdated     Pending   4.19.0-ec.2   ?

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                          ASSESSMENT   PHASE     VERSION       EST   MESSAGE
ip-10-0-24-151.ec2.internal   Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-57-13.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-97-93.ec2.internal    Outdated     Pending   4.19.0-ec.2   ?

= Update Health =
Message: Previous update to 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest never completed, last complete update was 4.19.0-ec.2
  Since:       56m59s
  Level:       Warning
  Impact:      None
  Reference:   https://docs.openshift.com/container-platform/latest/updating/troubleshooting_updates/gathering-data-cluster-update.html#gathering-clusterversion-history-cli_troubleshooting_updates
  Resources:
    clusterversions.config.openshift.io: version
  Description: Current update to 4.19.0-0.test-2025-03-01-152723-ci-ln-r67x3s2-latest was initiated while the previous update to version 4.19.0-0.test-2025-02-27-234720-ci-ln-nmgvdzt-latest was still in progress

I think issue 2 above is caused by --details=operators works only when CO's Progressing=True. But for CO/image-registry, it was not the case at the time. We may want the status cmd to pick it up again from cv/version.

@hongkailiu hongkailiu changed the title [wip]OCPBUGS-23514: Better a message upon long CO updating OCPBUGS-23514: Better a message upon long CO updating Mar 1, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 1, 2025
@petr-muller
Copy link
Member

petr-muller commented Mar 6, 2025

Instead of CVO's figuring out the underlying pod's CrashLoopBackOff which might be better to be implemented by deployment controller, it is expected that our cluster admin starts to dig into the cluster when such a message pops up.

Yeah, I agree - the crash is actually the easy case, because then there is at least some symptom that can be noticed and it is somewhat clear that if CVO says it is waiting for an operator-x, and you discover it is not running, then something is wrong. I think our fix needs to handle the worst case of this failure case - CVO waits for operator-x to bump a version and it never does that, but there is no bad symptom otherwise - operator-x is running, not degraded, available. Just no progress (like the reproducer I made).

For now, we just modify the condition's message. We could propagate Fail=True in case such requirements are collected from customers.

Surfacing the condition in the message makes the situation slightly better, but my concern is that we cannot easily consume this data for Status API / command. I want oc adm upgrade status to be at least a bit loud when there is a stuck update. In the current state, status cannot detect the stuck update, because all machine-readable parts of the API are identical to the happy update (Progressing=True, Failing=False).

I'd like to come up with at least something. Personally I would do Failing=True and would be happy about it :D Some proposals:

  • Failing=Unknown with a Reason=SlowClusterOperator or similar. Technically "unknown" describes the state well but I am not sure if consumers can handle the Unknown value well (would need to be tested with at least Web Console
  • Progressing=Unknown with a Reason=SlowClusterOperator or similar. You could argue that if we wait for something to happen, and it does not happen for a long time, we do not know if we are really progressing anymore. The downside is that Progressing now has a high-level meaning of "cluster started updating and have not finished yet", and it is likely that some consumers do not expect that semantics to change (=expect Progressing=True to be consistently up until the update is completed)
  • Progressing=False with a Reason=SlowClusterOperator or similar. Stronger variant of the above. If we do not see progress then we are not progressing. Same downside, but stronger (Progressing=True means ongoing update, Progressing=False means no ongoing update)
  • Progressing=True with a Reason=SlowClusterOperato or similar. Minimal variant that does not change important signals but at least allows us to dfferentiate Reason=HappyProgress from Reason=SlowClusterOperator. Tehcnically this does not meet the condition contract, because reason should be tied to last transition not to the current fine state (it should say why Progressing went from False to True, not what is the current fine reason for being True).
  • UpdateProcedureProgressing=False with a Reason=SlowClusterOperator. Basically a new condition that has a similar, but much finer semantics like Failing=True. If it is True it means that everything is 100% healthy from CVO standpoint. If it is False then we are waiting for something to happen, and the Reason and Message would say what.

We may want the status cmd to pick it up again from cv/version.

I do not want this for the reasons above.

@hongkailiu
Copy link
Member Author

I am going to do the following (the other three options above may lead to questions/trouble from users. I might come back to them if neither of the following goes thro :knock :knock :knock):

  • Failing=Unknown: I would go for this (Plan A) if console is happy with it OR the unhappiness of console can be easily fixed.
  • UpdateProcedureProgressing=False: Plan B because learning/maintaining a new thing is always expensive for both us/maintainers and others/users.

@petr-muller
Copy link
Member

SGTM

@openshift-ci-robot
Copy link
Contributor

@hongkailiu: This pull request references Jira Issue OCPBUGS-23514, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @dis016

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

When it takes too long (90m+ for machine-config and 30m+ for
others) to upgrade a cluster operator, clusterversion shows
a message with the indication that the upgrade might hit
some issue.

This will cover the case in the related OCPBUGS-23538: for some
reason, the pod under the deployment that manages the CO hit
CrashLoopBackOff. Deployment controller does not give useful
conditions in this situation [1]. Otherwise, checkDeploymentHealth [2]
would detect it.

Instead of CVO's figuring out the underlying pod's
CrashLoopBackOff which might be better to be implemented by
deployment controller, it is expected that our cluster admin
starts to dig into the cluster when such a message pops up.

In addition to the condition's message. We propagate Fail=Unknown
to make it available for other automations, such as update-status
command.

[1]. kubernetes/kubernetes#106054

[2].

func (b *builder) checkDeploymentHealth(ctx context.Context, deployment *appsv1.Deployment) error {
if b.mode == InitializingMode {
return nil
}
iden := fmt.Sprintf("%s/%s", deployment.Namespace, deployment.Name)
if deployment.DeletionTimestamp != nil {
return fmt.Errorf("deployment %s is being deleted", iden)
}
var availableCondition *appsv1.DeploymentCondition
var progressingCondition *appsv1.DeploymentCondition
var replicaFailureCondition *appsv1.DeploymentCondition
for idx, dc := range deployment.Status.Conditions {
switch dc.Type {
case appsv1.DeploymentProgressing:
progressingCondition = &deployment.Status.Conditions[idx]
case appsv1.DeploymentAvailable:
availableCondition = &deployment.Status.Conditions[idx]
case appsv1.DeploymentReplicaFailure:
replicaFailureCondition = &deployment.Status.Conditions[idx]
}
}
if replicaFailureCondition != nil && replicaFailureCondition.Status == corev1.ConditionTrue {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s has some pods failing; unavailable replicas=%d", iden, deployment.Status.UnavailableReplicas),
Reason: "WorkloadNotProgressing",
Message: fmt.Sprintf("deployment %s has a replica failure %s: %s", iden, replicaFailureCondition.Reason, replicaFailureCondition.Message),
Name: iden,
}
}
if availableCondition != nil && availableCondition.Status == corev1.ConditionFalse && progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s is not available and not progressing; updated replicas=%d of %d, available replicas=%d of %d", iden, deployment.Status.UpdatedReplicas, deployment.Status.Replicas, deployment.Status.AvailableReplicas, deployment.Status.Replicas),
Reason: "WorkloadNotAvailable",
Message: fmt.Sprintf("deployment %s is not available %s (%s) or progressing %s (%s)", iden, availableCondition.Reason, availableCondition.Message, progressingCondition.Reason, progressingCondition.Message),
Name: iden,
}
}
if progressingCondition != nil && progressingCondition.Status == corev1.ConditionFalse && progressingCondition.Reason == "ProgressDeadlineExceeded" {
return &payload.UpdateError{
Nested: fmt.Errorf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),
Reason: "WorkloadNotProgressing",
Message: fmt.Sprintf("deployment %s is %s=%s: %s: %s", iden, progressingCondition.Type, progressingCondition.Status, progressingCondition.Reason, progressingCondition.Message),
Name: iden,
}
}
if availableCondition == nil && progressingCondition == nil && replicaFailureCondition == nil {
klog.Warningf("deployment %s is not setting any expected conditions, and is therefore in an unknown state", iden)
}
return nil
}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@hongkailiu hongkailiu changed the title OCPBUGS-23514: Better a message upon long CO updating OCPBUGS-23514: Failing=Unknown upon long CO updating Mar 7, 2025
@openshift-ci openshift-ci bot requested a review from dis016 March 7, 2025 20:31
@hongkailiu
Copy link
Member Author

hongkailiu commented Mar 8, 2025

Triggered a build with:

build 4.19,openshift/cluster-image-registry-operator#1184,openshift/cluster-version-operator#1165

Then upgrade from a cluster of version 4.19.0-ec.2:

$ oc adm upgrade --to-image registry.build06.ci.openshift.org/ci-ln-jyv0ydk/release:latest --force --allow-explicit-upgrade

$ oc get clusterversion version
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.19.0-ec.2   True        True          56m     Working towards 4.19.0-0.test-2025-03-07-224529-ci-ln-jyv0ydk-latest: 711 of 903 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected

$ oc get clusterversion version -o yaml | yq '.status.conditions[]|select(.type=="Failing" or .type=="Progressing")'
{
  "lastTransitionTime": "2025-03-08T00:25:42Z",
  "message": "waiting on image-registry over 30 minutes which is longer than expected",
  "reason": "SlowClusterOperator",
  "status": "Unknown",
  "type": "Failing"
}
{
  "lastTransitionTime": "2025-03-07T23:30:24Z",
  "message": "Working towards 4.19.0-0.test-2025-03-07-224529-ci-ln-jyv0ydk-latest: 711 of 903 done (78% complete), waiting on image-registry over 30 minutes which is longer than expected",
  "reason": "ClusterOperatorUpdating",
  "status": "True",
  "type": "Progressing"
}

$ OC_ENABLE_CMD_UPGRADE_STATUS=true oc adm upgrade status --details=all
Unable to fetch alerts, ignoring alerts in 'Update Health':  failed to get alerts from Thanos: no token is currently in use for this session
= Control Plane =
Assessment:      Progressing
Target Version:  4.19.0-0.test-2025-03-07-224529-ci-ln-jyv0ydk-latest (from 4.19.0-ec.2)
Completion:      85% (29 operators updated, 0 updating, 5 waiting)
Duration:        58m (Est. Time Remaining: 15m)
Operator Health: 34 Healthy

Control Plane Nodes
NAME                                        ASSESSMENT   PHASE     VERSION       EST   MESSAGE
ip-10-0-26-177.us-east-2.compute.internal   Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-58-15.us-east-2.compute.internal    Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-78-1.us-east-2.compute.internal     Outdated     Pending   4.19.0-ec.2   ?

= Worker Upgrade =

WORKER POOL   ASSESSMENT   COMPLETION   STATUS
worker        Pending      0% (0/3)     3 Available, 0 Progressing, 0 Draining

Worker Pool Nodes: worker
NAME                                        ASSESSMENT   PHASE     VERSION       EST   MESSAGE
ip-10-0-26-169.us-east-2.compute.internal   Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-3-173.us-east-2.compute.internal    Outdated     Pending   4.19.0-ec.2   ?
ip-10-0-81-168.us-east-2.compute.internal   Outdated     Pending   4.19.0-ec.2   ?

= Update Health =
Message: Cluster Version version is failing to proceed with the update (SlowClusterOperator)
  Since:       2m24s
  Level:       Warning
  Impact:      Update Stalled
  Reference:   https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ClusterOperatorDegraded.md
  Resources:
    clusterversions.config.openshift.io: version
  Description: waiting on image-registry over 30 minutes which is longer than expected

The "failing" in Message is not very precise. But I think we can fix it in the status cmd.
I am not sure about Reference tho.
(I will check those two in oc code later)

Also get some screenshots at the time of "longer than expected":
It did not show the message but it at least was not upset about the changes in the PR.
"An update is already in progress and the details are in the Progressing condition" on the UI will direct the user to the right place (see the Progressing condition in the output above).

Screenshot 2025-03-07 at 7 33 57 PM Screenshot 2025-03-07 at 7 33 38 PM Screenshot 2025-03-07 at 7 33 25 PM

@hongkailiu
Copy link
Member Author

/test okd-scos-e2e-aws-ovn

hongkailiu added a commit to hongkailiu/oc that referenced this pull request Mar 11, 2025
[cvo#1165](openshift/cluster-version-operator#1165)
introduced uncertainty of the Failing condition (`Failing=Unknown`).
This PR adjusts the impact summary in updateInsight accordingly.
hongkailiu added a commit to hongkailiu/oc that referenced this pull request Mar 11, 2025
[cvo#1165](openshift/cluster-version-operator#1165)
introduced uncertainty of the Failing condition (`Failing=Unknown`).
This PR adjusts the impact summary in updateInsight accordingly.
When it takes too long (90m+ for machine-config and 30m+ for
others) to upgrade a cluster operator, clusterversion shows
a message with the indication that the upgrade might hit
some issue.

This will cover the case in the related OCPBUGS-23538: for some
reason, the pod under the deployment that manages the CO hit
CrashLoopBackOff. Deployment controller does not give useful
conditions in this situation [1]. Otherwise, checkDeploymentHealth [2]
would detect it.

Instead of CVO's figuring out the underlying pod's
CrashLoopBackOff which might be better to be implemented by
deployment controller, it is expected that our cluster admin
starts to dig into the cluster when such a message pops up.

In addition to the condition's message. We propagate Fail=Unknown
to make it available for other automations, such as update-status
command.

[1]. kubernetes/kubernetes#106054

[2]. https://github.com/openshift/cluster-version-operator/blob/08c0459df5096e9f16fad3af2831b62d06d415ee/lib/resourcebuilder/apps.go#L79-L136
@hongkailiu
Copy link
Member Author

/retest-required

Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should not communicate through the condition message but through some of the machine-readable fields (or introduce explicit data to pass) #1165 (comment)

@hongkailiu hongkailiu force-pushed the OCPBUGS-23514 branch 3 times, most recently from 01cef29 to 292e409 Compare March 26, 2025 13:51
Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 26, 2025
Copy link
Contributor

openshift-ci bot commented Mar 26, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 26, 2025
@petr-muller
Copy link
Member

Thanks! I think this does not meet the acknowledge-critical-fixes-only bar so I'd wait, but in the meantime I wonder what can we do to pre-merge test this @jiajliu @dis016

Hongkai did some decent testing #1165 (comment)

@hongkailiu
Copy link
Member Author

/retest

Copy link
Contributor

openshift-ci bot commented Mar 28, 2025

@hongkailiu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agnostic-operator-devpreview 8892f42 link false /test e2e-agnostic-operator-devpreview

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jiajliu
Copy link

jiajliu commented Mar 31, 2025

but in the meantime I wonder what can we do to pre-merge test this @jiajliu @dis016

Hongkai's test SGTM, especially the break-upgrade way from Petr's pr is wonderful and enlightening. After reading through the bug and pr, following test scenarios comes into my mind.

  1. Test Failing=unknown condition as expected when slow update happen on mco and non-mco operator, using the way introduced in the cluster-image-registry operator pr
  2. Regression test Failing=True condition still works well(not slow update issue, maybe operator degraded)

^^ is my personal understanding, so leave it to Dinesh for more thoughts on final pre-merge test.

@dis016
Copy link

dis016 commented Apr 2, 2025

Thanks @jiajliu i also feel the above scenario's are good for this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants