Skip to content

Bump k8s dependencies to 1.23 to align with o-f/api #2574

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

fgiloux
Copy link
Contributor

@fgiloux fgiloux commented Jan 19, 2022

Signed-off-by: Frederic Giloux [email protected]

Description of the change:
Bump k8s dependencies to 1.23 to align with o-f/api
Also as a side effect: newer logr version: v0.4 => v1.2

Motivation for the change:

Alignment with o-f/api, alignment with newer k8 versions.

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive

Closes: #2573

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 19, 2022
@openshift-ci
Copy link

openshift-ci bot commented Jan 19, 2022

Hi @fgiloux. Thanks for your PR.

I'm waiting for a operator-framework member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link

openshift-ci bot commented Jan 19, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: fgiloux
To complete the pull request process, please assign timflannagan after the PR has been reviewed.
You can assign the PR to them by writing /assign @timflannagan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fgiloux
Copy link
Contributor Author

fgiloux commented Jan 19, 2022

@timflannagan as discussed the PR for the dependency bump. I am not sure why the build fails.
It seems to pick a different commit:

go build -mod=vendor -ldflags "-X github.com/operator-framework/operator-lifecycle-manager/pkg/version.GitCommit=495a1d5513067fce1fb60d456dd3e9cfc718df4a -X github.com/operator-framework/operator-lifecycle-manager/pkg/version.OLMVersion=`cat OLM_VERSION`" -tags "json1" -o bin/catalog github.com/operator-framework/operator-lifecycle-manager/cmd/catalog

Building locally is successful, make build then uses GitCommit=121c31f17f4ea121a7c1a7a3fda58bd03446aedf

This is most likely the culprit (go 1.17 should be used):

golang                   x86_64   1.16.13-1.fc34               updates   608 k

@fgiloux
Copy link
Contributor Author

fgiloux commented Jan 19, 2022

This is most likely the culprit (go 1.17 should be used):

golang                   x86_64   1.16.13-1.fc34               updates   608 k

I have amended the Dockerfile so that go 1.17 gets installed (not available in Fedora 34)

@fgiloux fgiloux force-pushed the 2573 branch 15 times, most recently from 836686d to fbcb096 Compare January 19, 2022 21:32
@dinhxuanvu
Copy link
Member

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 20, 2022
@fgiloux
Copy link
Contributor Author

fgiloux commented Jan 20, 2022

I had to amend the github action for unit tests:

  • the envtest binaries were very very old (K8 1.16 IIRC)
  • the distribution of kubebuilder binaries and envtest changed in the meantime
  • kubebuilder is not really needed for the tests and not part of the envtest installation so that I disabled its check in the Makefile

@fgiloux
Copy link
Contributor Author

fgiloux commented Jan 20, 2022

@timflannagan

Next issue:

The CustomResourceDefinition "clusterserviceversions.operators.coreos.com" is invalid: metadata.annotations: Too long: must have at most 262144 bytes

It seems that the main change is the addition of a grpc field, which increases the size of the CSV CRD. I am missing some background. I will have a look but any hint is welcome

@dinhxuanvu
Copy link
Member

dinhxuanvu commented Jan 20, 2022

@timflannagan

Next issue:

The CustomResourceDefinition "clusterserviceversions.operators.coreos.com" is invalid: metadata.annotations: Too long: must have at most 262144 bytes

It seems that the main change is the addition of a grpc field, which increases the size of the CSV CRD. I am missing some background. I will have a look but any hint is welcome

I dont think this is a grpc problem. This seems to be a part of the install script where we use kubectl apply which will save the entire CRD yaml into annotation but that should only happen if you apply to already-existing CRD. At the startup time, that CRD shouldn't exist. Not sure why.

@fgiloux
Copy link
Contributor Author

fgiloux commented Jan 20, 2022

we use kubectl apply which will save the entire CRD yaml into annotation but that should only happen if you apply to already-existing CRD.

If you don't want the annotation kubectl create, kubectl replace should be used. It may not have happen earlier due to the old version in use. I will see whether I can amend the script

*crds.OperatorGroup(),
*crds.Operator(),
*crds.OperatorCondition(),
CRDs: []*apiextensionsv1.CustomResourceDefinition{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting - it looks like this was changed in kubernetes-sigs/controller-runtime@a14a68c but if I'm reading that commit tagging right, it's only available in v0.11.0 versions of c-r but we still replace pin to my fork's v0.10.x version. Any idea on why we need to make this change? In order to make CI happy as we're pulling in the latest envtest binary now, so there's some potential skew between our c-r version and entest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed something:

require (
[...]
sigs.k8s.io/controller-runtime v0.11.0
[...]
replace (
sigs.k8s.io/controller-runtime v0.10.0 => github.com/timflannagan/controller-runtime v0.10.1-0.20211210161403-6756a4203e70

The replace version v0.10.0 does not match the request version v0.11.0. This means that v0.11.0 was in use. With your version of controller-runtime there are incompatibilities with the logr version brought by o-f/api, for instance:

vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:79:12: undefined: logr.WithCallDepth
vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:163:2: cannot use res (type *DelegatingLogger) as type logr.Logger in return argument

I am looking at: #2353

Copy link
Contributor Author

@fgiloux fgiloux Jan 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at #2368 but did not see why you decided to keep being on a fork. The issue you mentioned in regards to the way metadata is handled in controller-runtime has been merged with commit 74ba5d77bf99d8968543f3536e4d49a2e3818fee, which is in v0.11
Am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat blanking on the overall context as to why that fork was necessary when bumping the c-r dependency in OLM to v0.10.x at the time. I remember seeing e2e was constantly failing despite the changes being fairly innocuous, and we were under the impression that Joe's upstream fix solved the issue we originally had to hack around. Let me see if I can dig up any more context today, but given the current e2e checks are passing, it's possible we had misdiagnosed the root case at the time.

@timflannagan
Copy link
Member

@fgiloux I think we're getting closer to the finish line with these changes. Apologies for the slow turn around over the past week, but it seems like the github UI still doesn't handle large changesets well, so overall review time is substantially increased due to pages becoming unresponsive.

I think we need to do the following to push this along:

  • Any PR this size will need another reviewer to review these changes
  • We need to be certain that we're not regressing with the controller-runtime bump w.r.t to metadata-only watchers
  • Iron out the envtest installation in CI

Is there anything that I'm missing?

@fgiloux
Copy link
Contributor Author

fgiloux commented Jan 21, 2022

Is there anything that I'm missing?

Besides Kube and go versions and there is a fair amount of tests for controlling them the biggest change is in regards to the logging stack. It may be good to make a sanity check to ensure that logs have not degraded.

@fgiloux
Copy link
Contributor Author

fgiloux commented Jan 25, 2022

@timflannagan @perdasilva I have rebased the PR. It would be good to get it merged now.

@timflannagan
Copy link
Member

I think this is largely read to go but I haven't verified the controller-runtime changes yet. If we're bumping the c-r version, we should also remove the replace pin from the root go.mod.

@fgiloux
Copy link
Contributor Author

fgiloux commented Jan 25, 2022

If we're bumping the c-r version, we should also remove the replace pin from the root go.mod.

I have done that as well. Let see how the tests go.

@timflannagan
Copy link
Member

IIRC, we ran into these two specific test case failures last time we tried to bump the c-r version (hence why the replace pin to my fork was needed): https://github.com/operator-framework/operator-lifecycle-manager/runs/4942485265?check_suite_focus=true

@fgiloux fgiloux mentioned this pull request Jan 26, 2022
@timflannagan
Copy link
Member

@fgiloux I'm going to respin the e2e tests again to double-check whether we're running into the same test case failures last time we tried bumping the c-r version to v0.11.x.

@fgiloux
Copy link
Contributor Author

fgiloux commented Jan 26, 2022

@timflannagan I have been investigating locally. "OperatorCondition Upgradeable type and overrides " passed but "Operator API when a subscription to a package exists [It] should automatically adopt components" consistently failed.
As far as I have seen the reference to the service account is missing in the operator and prevents:

Eventually(func() (*operatorsv1.Operator, error) {
                                o := &operatorsv1.Operator{}
                                err := client.Get(clientCtx, operatorName, o)
                                return o, err
                        }).Should(ReferenceComponents([]*corev1.ObjectReference{
                                getReference(scheme, sub),
                                getReference(scheme, ip),
                                getReference(scheme, testobj.WithNamespacedName(
                                        &types.NamespacedName{Namespace: sub.GetNamespace(), Name: "kiali-operator.v1.4.2"},
                                        &operatorsv1alpha1.ClusterServiceVersion{},
                                )),
                                getReference(scheme, testobj.WithNamespacedName(
                                        &types.NamespacedName{Namespace: sub.GetNamespace(), Name: "kiali-operator"},
                                        &corev1.ServiceAccount{},
                                )),
                                getReference(scheme, testobj.WithName("kialis.kiali.io", &apiextensionsv1.CustomResourceDefinition{})),
                                getReference(scheme, testobj.WithName("monitoringdashboards.monitoring.kiali.io", &apiextensionsv1.CustomResourceDefinition{})),
                        }))

Looking at the logs after failure I can see:

E0126 13:54:10.097690       1 queueinformer_operator.go:290] sync {"update" "ns-wh857/cs-62lnj"} failed: couldn't ensure registry server - error ensuring service account: cs-62lnj: serviceaccounts "cs-62lnj" is forbidden: unable to create new content in namespace ns-wh857 because it is being terminated

@fgiloux
Copy link
Contributor Author

fgiloux commented Jan 27, 2022

@timflannagan I spent more time debugging the e2e test mentioned in my previous comment. I got it to be successful locally but there are quite a few things that don't look right and sorry if I am missing something. Here are the points that are not what I would expect:

  • The resources get added to the references in the operator only if they are created as part of the installation and have the right label, something like: operators.coreos.com/kiali.ns-67rrb: "". That means if the SA or the CRDs were created beforehand the test will fail.
  • The AfterEach of the e2e test does not fully clean the resources created. It only deletes the namespace. Note that the operator resource created here is used in the next context. This means that CRDs, operator are left and they prevent due to the first point any second run to be successful in the same cluster. As per the doc I run the specific e2e test with:
GO111MODULE=on GOFLAGS="-mod=vendor" go run github.com/onsi/ginkgo/ginkgo -focus "should automatically adopt components" -v --progress ./test/e2e -- -namespace=operators -olmNamespace=olm -dummyImage=bitnami/nginx:latest
  • When an operator resource gets deleted, the reconciliation does not rebuild it based on the cluster state but based on an old version of the operator resource stored in memory. I had to restart the olm pods to get rid of an operator resource after I had deleted all the resources listed in the references part of the operator. Otherwise it would get created referencing again resources that did not exist anymore.

The e2e test could be tweaked to pass by ensuring a virgin environment before it runs but it seems to highlight something that it is genuinely suspicious. I doubt that this is directly due to the dependency bump. It may only exacerbate an existing behavior.

@timflannagan
Copy link
Member

@fgiloux Nice, I've seen similar issues locally when debugging these tests further. I had opened #2518 which aimed to fixed that AfterEach issue but I haven't had the time to revisit it recently. That Operator controller has some known caching issues due to the way we track the last resourceVersion that we processed, so it's possible that you can delete all the requisite underlying resources (e.g. SA, CRDs, etc.) and the Operator resource will still be created. I tried to fix that behavior (without fixing the root case) by firing off a DeleteCollection call at the top-level e2e spec.

If I remember correctly with the last c-r bump, we were running into constant e2e test failures in the adoption and operator controllers, where both of those controllers weren't able to adopt resources (note: it was the ServiceAccount resource nearly 100% of the time) and propagate them to the operator resource's status sub-resource. That ServiceAccount resource was registered as a metadata-only watch when instantiating both controllers, and I was under the impression we might've been missing cache events at the informer level, and the new upstream fix in the v0.11.x release didn't work for us.

We don't have to fix the test setup issues in this PR, but I'd like to avoid a situation where OLM's existing suite is already extremely flaky, and becomes more flaky due to a dependency bump.

@fgiloux
Copy link
Contributor Author

fgiloux commented Jan 27, 2022

@timflannagan I forgot one point:

  • The kiali CSV is full of RBAC policies referencing resources that don't exist on a KinD cluster, an extract:
I0127 05:54:06.485179       1 rbac.go:118] RBAC DENY: user "system:serviceaccount:ns-4vxkx:kiali-operator" groups [] cannot "create" resource "fluentds.config.istio.io" cluster-wide
I0127 05:54:06.485232       1 rbac.go:118] RBAC DENY: user "system:serviceaccount:ns-4vxkx:kiali-operator" groups [] cannot "create" resource "sidecars.networking.istio.io" cluster-wide
I0127 05:54:06.485268       1 rbac.go:118] RBAC DENY: user "system:serviceaccount:ns-4vxkx:kiali-operator" groups [] cannot "watch" resource "meshpolicies.authentication.istio.io" cluster-wide

There are many points in #2518. Too bad that it has not been merged yet.

For my own sake what is the point of tracking the last resourceVersion and having different logic based on it? It seems not aligned with general practices. What am I missing?

That ServiceAccount resource was registered as a metadata-only watch when instantiating both controllers, and I was under the impression we might've been missing cache events at the informer level,

It could be but when I checked the SA did not have the label. If the controller logic for the operator was working it should add the reference to the operator when the label is added to the SA but it did not in my test.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 28, 2022
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 28, 2022
@awgreene
Copy link
Member

awgreene commented Feb 3, 2022

The following test is failing locally for me, looking into it now.

[Fail] Catalog represents a store of bundles which OLM can use to install Operators [It] registry polls on the correct interval

- logr updated from v0.4.0 to v1.2.0 (cascading implications)
- Bumped to go 1.17 in Dockerfile
- Amended the github action for unit tests:
  - the envtest binaries were very very old (K8 1.16)
  - the distribution of kubebuilder binaries and envtest changed in the meantime
  - kubebuilder is not really needed for the tests and not part of the envtest installation ->  check disabled in Makefile
- Used `kubectl create` instead of `kubectl apply` to avoid too long annotations for CRDs

Signed-off-by: Frederic Giloux <[email protected]>
@awgreene
Copy link
Member

Closed in favor of #2617

@awgreene awgreene closed this Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Indicates a non-member PR verified by an org member that is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bumping operator-framework API dependency to v0.11.1 breaks OLM
4 participants