Bump k8s dependencies to 1.23 to align with o-f/api #2574

fgiloux · 2022-01-19T10:48:23Z

Signed-off-by: Frederic Giloux [email protected]

Description of the change:
Bump k8s dependencies to 1.23 to align with o-f/api
Also as a side effect: newer logr version: v0.4 => v1.2

Motivation for the change:

Alignment with o-f/api, alignment with newer k8 versions.

Reviewer Checklist

Implementation matches the proposed design, or proposal is updated to match implementation
Sufficient unit test coverage
Sufficient end-to-end test coverage
Docs updated or added to /doc
Commit messages sensible and descriptive

Closes: #2573

openshift-ci · 2022-01-19T10:48:33Z

Hi @fgiloux. Thanks for your PR.

I'm waiting for a operator-framework member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-01-19T10:48:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: fgiloux
To complete the pull request process, please assign timflannagan after the PR has been reviewed.
You can assign the PR to them by writing /assign @timflannagan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fgiloux · 2022-01-19T11:29:08Z

@timflannagan as discussed the PR for the dependency bump. I am not sure why the build fails.
It seems to pick a different commit:

go build -mod=vendor -ldflags "-X github.com/operator-framework/operator-lifecycle-manager/pkg/version.GitCommit=495a1d5513067fce1fb60d456dd3e9cfc718df4a -X github.com/operator-framework/operator-lifecycle-manager/pkg/version.OLMVersion=`cat OLM_VERSION`" -tags "json1" -o bin/catalog github.com/operator-framework/operator-lifecycle-manager/cmd/catalog

Building locally is successful, make build then uses GitCommit=121c31f17f4ea121a7c1a7a3fda58bd03446aedf

This is most likely the culprit (go 1.17 should be used):

golang                   x86_64   1.16.13-1.fc34               updates   608 k

fgiloux · 2022-01-19T13:54:33Z

This is most likely the culprit (go 1.17 should be used):
golang                   x86_64   1.16.13-1.fc34               updates   608 k

I have amended the Dockerfile so that go 1.17 gets installed (not available in Fedora 34)

dinhxuanvu · 2022-01-20T04:47:16Z

/ok-to-test

fgiloux · 2022-01-20T05:50:58Z

I had to amend the github action for unit tests:

the envtest binaries were very very old (K8 1.16 IIRC)
the distribution of kubebuilder binaries and envtest changed in the meantime
kubebuilder is not really needed for the tests and not part of the envtest installation so that I disabled its check in the Makefile

fgiloux · 2022-01-20T05:53:00Z

@timflannagan

Next issue:

The CustomResourceDefinition "clusterserviceversions.operators.coreos.com" is invalid: metadata.annotations: Too long: must have at most 262144 bytes

It seems that the main change is the addition of a grpc field, which increases the size of the CSV CRD. I am missing some background. I will have a look but any hint is welcome

dinhxuanvu · 2022-01-20T06:07:51Z

@timflannagan

Next issue:
The CustomResourceDefinition "clusterserviceversions.operators.coreos.com" is invalid: metadata.annotations: Too long: must have at most 262144 bytes
It seems that the main change is the addition of a grpc field, which increases the size of the CSV CRD. I am missing some background. I will have a look but any hint is welcome

I dont think this is a grpc problem. This seems to be a part of the install script where we use kubectl apply which will save the entire CRD yaml into annotation but that should only happen if you apply to already-existing CRD. At the startup time, that CRD shouldn't exist. Not sure why.

fgiloux · 2022-01-20T07:02:54Z

we use kubectl apply which will save the entire CRD yaml into annotation but that should only happen if you apply to already-existing CRD.

If you don't want the annotation kubectl create, kubectl replace should be used. It may not have happen earlier due to the old version in use. I will see whether I can amend the script

go.mod

timflannagan · 2022-01-20T20:28:55Z

pkg/controller/operators/suite_test.go

-			*crds.OperatorGroup(),
-			*crds.Operator(),
-			*crds.OperatorCondition(),
+		CRDs: []*apiextensionsv1.CustomResourceDefinition{


Interesting - it looks like this was changed in kubernetes-sigs/controller-runtime@a14a68c but if I'm reading that commit tagging right, it's only available in v0.11.0 versions of c-r but we still replace pin to my fork's v0.10.x version. Any idea on why we need to make this change? In order to make CI happy as we're pulling in the latest envtest binary now, so there's some potential skew between our c-r version and entest?

I may have missed something:

require ( [...] sigs.k8s.io/controller-runtime v0.11.0 [...] replace ( sigs.k8s.io/controller-runtime v0.10.0 => github.com/timflannagan/controller-runtime v0.10.1-0.20211210161403-6756a4203e70

The replace version v0.10.0 does not match the request version v0.11.0. This means that v0.11.0 was in use. With your version of controller-runtime there are incompatibilities with the logr version brought by o-f/api, for instance:

vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:79:12: undefined: logr.WithCallDepth vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:163:2: cannot use res (type *DelegatingLogger) as type logr.Logger in return argument

I am looking at: #2353

I looked at #2368 but did not see why you decided to keep being on a fork. The issue you mentioned in regards to the way metadata is handled in controller-runtime has been merged with commit 74ba5d77bf99d8968543f3536e4d49a2e3818fee, which is in v0.11
Am I missing something?

I'm somewhat blanking on the overall context as to why that fork was necessary when bumping the c-r dependency in OLM to v0.10.x at the time. I remember seeing e2e was constantly failing despite the changes being fairly innocuous, and we were under the impression that Joe's upstream fix solved the issue we originally had to hack around. Let me see if I can dig up any more context today, but given the current e2e checks are passing, it's possible we had misdiagnosed the root case at the time.

pkg/controller/operators/openshift/options.go

scripts/install_local.sh

pkg/controller/operators/openshift/options.go

timflannagan · 2022-01-21T16:10:37Z

@fgiloux I think we're getting closer to the finish line with these changes. Apologies for the slow turn around over the past week, but it seems like the github UI still doesn't handle large changesets well, so overall review time is substantially increased due to pages becoming unresponsive.

I think we need to do the following to push this along:

Any PR this size will need another reviewer to review these changes
We need to be certain that we're not regressing with the controller-runtime bump w.r.t to metadata-only watchers
Iron out the envtest installation in CI

Is there anything that I'm missing?

fgiloux · 2022-01-21T19:54:54Z

Is there anything that I'm missing?

Besides Kube and go versions and there is a fair amount of tests for controlling them the biggest change is in regards to the logging stack. It may be good to make a sanity check to ensure that logs have not degraded.

fgiloux · 2022-01-25T10:09:39Z

@timflannagan @perdasilva I have rebased the PR. It would be good to get it merged now.

.github/workflows/unit.yml

timflannagan · 2022-01-25T16:07:39Z

I think this is largely read to go but I haven't verified the controller-runtime changes yet. If we're bumping the c-r version, we should also remove the replace pin from the root go.mod.

fgiloux · 2022-01-25T19:55:14Z

If we're bumping the c-r version, we should also remove the replace pin from the root go.mod.

I have done that as well. Let see how the tests go.

timflannagan · 2022-01-25T22:15:12Z

IIRC, we ran into these two specific test case failures last time we tried to bump the c-r version (hence why the replace pin to my fork was needed): https://github.com/operator-framework/operator-lifecycle-manager/runs/4942485265?check_suite_focus=true

timflannagan · 2022-01-26T14:48:16Z

@fgiloux I'm going to respin the e2e tests again to double-check whether we're running into the same test case failures last time we tried bumping the c-r version to v0.11.x.

fgiloux · 2022-01-26T14:54:10Z

@timflannagan I have been investigating locally. "OperatorCondition Upgradeable type and overrides " passed but "Operator API when a subscription to a package exists [It] should automatically adopt components" consistently failed.
As far as I have seen the reference to the service account is missing in the operator and prevents:

Eventually(func() (*operatorsv1.Operator, error) {
                                o := &operatorsv1.Operator{}
                                err := client.Get(clientCtx, operatorName, o)
                                return o, err
                        }).Should(ReferenceComponents([]*corev1.ObjectReference{
                                getReference(scheme, sub),
                                getReference(scheme, ip),
                                getReference(scheme, testobj.WithNamespacedName(
                                        &types.NamespacedName{Namespace: sub.GetNamespace(), Name: "kiali-operator.v1.4.2"},
                                        &operatorsv1alpha1.ClusterServiceVersion{},
                                )),
                                getReference(scheme, testobj.WithNamespacedName(
                                        &types.NamespacedName{Namespace: sub.GetNamespace(), Name: "kiali-operator"},
                                        &corev1.ServiceAccount{},
                                )),
                                getReference(scheme, testobj.WithName("kialis.kiali.io", &apiextensionsv1.CustomResourceDefinition{})),
                                getReference(scheme, testobj.WithName("monitoringdashboards.monitoring.kiali.io", &apiextensionsv1.CustomResourceDefinition{})),
                        }))

Looking at the logs after failure I can see:

E0126 13:54:10.097690       1 queueinformer_operator.go:290] sync {"update" "ns-wh857/cs-62lnj"} failed: couldn't ensure registry server - error ensuring service account: cs-62lnj: serviceaccounts "cs-62lnj" is forbidden: unable to create new content in namespace ns-wh857 because it is being terminated

fgiloux · 2022-01-27T13:30:39Z

@timflannagan I spent more time debugging the e2e test mentioned in my previous comment. I got it to be successful locally but there are quite a few things that don't look right and sorry if I am missing something. Here are the points that are not what I would expect:

The resources get added to the references in the operator only if they are created as part of the installation and have the right label, something like: operators.coreos.com/kiali.ns-67rrb: "". That means if the SA or the CRDs were created beforehand the test will fail.
The AfterEach of the e2e test does not fully clean the resources created. It only deletes the namespace. Note that the operator resource created here is used in the next context. This means that CRDs, operator are left and they prevent due to the first point any second run to be successful in the same cluster. As per the doc I run the specific e2e test with:

GO111MODULE=on GOFLAGS="-mod=vendor" go run github.com/onsi/ginkgo/ginkgo -focus "should automatically adopt components" -v --progress ./test/e2e -- -namespace=operators -olmNamespace=olm -dummyImage=bitnami/nginx:latest

When an operator resource gets deleted, the reconciliation does not rebuild it based on the cluster state but based on an old version of the operator resource stored in memory. I had to restart the olm pods to get rid of an operator resource after I had deleted all the resources listed in the references part of the operator. Otherwise it would get created referencing again resources that did not exist anymore.

The e2e test could be tweaked to pass by ensuring a virgin environment before it runs but it seems to highlight something that it is genuinely suspicious. I doubt that this is directly due to the dependency bump. It may only exacerbate an existing behavior.

timflannagan · 2022-01-27T14:13:58Z

@fgiloux Nice, I've seen similar issues locally when debugging these tests further. I had opened #2518 which aimed to fixed that AfterEach issue but I haven't had the time to revisit it recently. That Operator controller has some known caching issues due to the way we track the last resourceVersion that we processed, so it's possible that you can delete all the requisite underlying resources (e.g. SA, CRDs, etc.) and the Operator resource will still be created. I tried to fix that behavior (without fixing the root case) by firing off a DeleteCollection call at the top-level e2e spec.

If I remember correctly with the last c-r bump, we were running into constant e2e test failures in the adoption and operator controllers, where both of those controllers weren't able to adopt resources (note: it was the ServiceAccount resource nearly 100% of the time) and propagate them to the operator resource's status sub-resource. That ServiceAccount resource was registered as a metadata-only watch when instantiating both controllers, and I was under the impression we might've been missing cache events at the informer level, and the new upstream fix in the v0.11.x release didn't work for us.

We don't have to fix the test setup issues in this PR, but I'd like to avoid a situation where OLM's existing suite is already extremely flaky, and becomes more flaky due to a dependency bump.

fgiloux · 2022-01-27T14:39:07Z

@timflannagan I forgot one point:

The kiali CSV is full of RBAC policies referencing resources that don't exist on a KinD cluster, an extract:

I0127 05:54:06.485179       1 rbac.go:118] RBAC DENY: user "system:serviceaccount:ns-4vxkx:kiali-operator" groups [] cannot "create" resource "fluentds.config.istio.io" cluster-wide
I0127 05:54:06.485232       1 rbac.go:118] RBAC DENY: user "system:serviceaccount:ns-4vxkx:kiali-operator" groups [] cannot "create" resource "sidecars.networking.istio.io" cluster-wide
I0127 05:54:06.485268       1 rbac.go:118] RBAC DENY: user "system:serviceaccount:ns-4vxkx:kiali-operator" groups [] cannot "watch" resource "meshpolicies.authentication.istio.io" cluster-wide

There are many points in #2518. Too bad that it has not been merged yet.

For my own sake what is the point of tracking the last resourceVersion and having different logic based on it? It seems not aligned with general practices. What am I missing?

That ServiceAccount resource was registered as a metadata-only watch when instantiating both controllers, and I was under the impression we might've been missing cache events at the informer level,

It could be but when I checked the SA did not have the label. If the controller logic for the operator was working it should add the reference to the operator when the label is added to the SA but it did not in my test.

awgreene · 2022-02-03T22:05:25Z

The following test is failing locally for me, looking into it now.

[Fail] Catalog represents a store of bundles which OLM can use to install Operators [It] registry polls on the correct interval

- logr updated from v0.4.0 to v1.2.0 (cascading implications) - Bumped to go 1.17 in Dockerfile - Amended the github action for unit tests: - the envtest binaries were very very old (K8 1.16) - the distribution of kubebuilder binaries and envtest changed in the meantime - kubebuilder is not really needed for the tests and not part of the envtest installation -> check disabled in Makefile - Used `kubectl create` instead of `kubectl apply` to avoid too long annotations for CRDs Signed-off-by: Frederic Giloux <[email protected]>

awgreene · 2022-02-22T19:02:34Z

Closed in favor of #2617

openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 19, 2022

openshift-ci bot requested review from ankitathomas and perdasilva January 19, 2022 10:48

fgiloux force-pushed the 2573 branch from 121c31f to a356902 Compare January 19, 2022 13:50

fgiloux force-pushed the 2573 branch 15 times, most recently from 836686d to fbcb096 Compare January 19, 2022 21:32

openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 20, 2022

timflannagan reviewed Jan 20, 2022

View reviewed changes

go.mod Show resolved Hide resolved

timflannagan reviewed Jan 20, 2022

View reviewed changes

pkg/controller/operators/openshift/options.go Outdated Show resolved Hide resolved

timflannagan reviewed Jan 21, 2022

View reviewed changes

scripts/install_local.sh Show resolved Hide resolved

timflannagan reviewed Jan 21, 2022

View reviewed changes

pkg/controller/operators/openshift/options.go Outdated Show resolved Hide resolved

fgiloux force-pushed the 2573 branch from e2f2c5b to c7a9b9e Compare January 21, 2022 19:50

fgiloux force-pushed the 2573 branch from c7a9b9e to 4e20d63 Compare January 25, 2022 09:38

timflannagan reviewed Jan 25, 2022

View reviewed changes

.github/workflows/unit.yml Outdated Show resolved Hide resolved

fgiloux force-pushed the 2573 branch from 4e20d63 to ffe35ff Compare January 25, 2022 19:52

fgiloux mentioned this pull request Jan 26, 2022

Clean up makefile #2598

Open

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 28, 2022

fgiloux force-pushed the 2573 branch from ffe35ff to b5e4570 Compare January 28, 2022 20:39

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 28, 2022

awgreene force-pushed the 2573 branch from b5e4570 to 8233ced Compare February 3, 2022 20:52

njhale force-pushed the 2573 branch from 8233ced to 6d831ee Compare February 16, 2022 20:19

awgreene closed this Feb 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump k8s dependencies to 1.23 to align with o-f/api #2574

Bump k8s dependencies to 1.23 to align with o-f/api #2574

fgiloux commented Jan 19, 2022

openshift-ci bot commented Jan 19, 2022

openshift-ci bot commented Jan 19, 2022

fgiloux commented Jan 19, 2022 •

edited

Loading

fgiloux commented Jan 19, 2022

dinhxuanvu commented Jan 20, 2022

fgiloux commented Jan 20, 2022

fgiloux commented Jan 20, 2022

dinhxuanvu commented Jan 20, 2022 •

edited

Loading

fgiloux commented Jan 20, 2022

timflannagan Jan 20, 2022

fgiloux Jan 21, 2022

fgiloux Jan 21, 2022 •

edited

Loading

timflannagan Jan 21, 2022

timflannagan commented Jan 21, 2022

fgiloux commented Jan 21, 2022

fgiloux commented Jan 25, 2022

timflannagan commented Jan 25, 2022

fgiloux commented Jan 25, 2022 •

edited

Loading

timflannagan commented Jan 25, 2022

timflannagan commented Jan 26, 2022

fgiloux commented Jan 26, 2022

fgiloux commented Jan 27, 2022

timflannagan commented Jan 27, 2022

fgiloux commented Jan 27, 2022

awgreene commented Feb 3, 2022

awgreene commented Feb 22, 2022

Bump k8s dependencies to 1.23 to align with o-f/api #2574

Bump k8s dependencies to 1.23 to align with o-f/api #2574

Conversation

fgiloux commented Jan 19, 2022

openshift-ci bot commented Jan 19, 2022

openshift-ci bot commented Jan 19, 2022

fgiloux commented Jan 19, 2022 • edited Loading

fgiloux commented Jan 19, 2022

dinhxuanvu commented Jan 20, 2022

fgiloux commented Jan 20, 2022

fgiloux commented Jan 20, 2022

dinhxuanvu commented Jan 20, 2022 • edited Loading

fgiloux commented Jan 20, 2022

timflannagan Jan 20, 2022

Choose a reason for hiding this comment

fgiloux Jan 21, 2022

Choose a reason for hiding this comment

fgiloux Jan 21, 2022 • edited Loading

Choose a reason for hiding this comment

timflannagan Jan 21, 2022

Choose a reason for hiding this comment

timflannagan commented Jan 21, 2022

fgiloux commented Jan 21, 2022

fgiloux commented Jan 25, 2022

timflannagan commented Jan 25, 2022

fgiloux commented Jan 25, 2022 • edited Loading

timflannagan commented Jan 25, 2022

timflannagan commented Jan 26, 2022

fgiloux commented Jan 26, 2022

fgiloux commented Jan 27, 2022

timflannagan commented Jan 27, 2022

fgiloux commented Jan 27, 2022

awgreene commented Feb 3, 2022

awgreene commented Feb 22, 2022

fgiloux commented Jan 19, 2022 •

edited

Loading

dinhxuanvu commented Jan 20, 2022 •

edited

Loading

fgiloux Jan 21, 2022 •

edited

Loading

fgiloux commented Jan 25, 2022 •

edited

Loading