-
Notifications
You must be signed in to change notification settings - Fork 553
Bump k8s dependencies to 1.23 to align with o-f/api #2574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @fgiloux. Thanks for your PR. I'm waiting for a operator-framework member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: fgiloux The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@timflannagan as discussed the PR for the dependency bump. I am not sure why the build fails.
Building locally is successful, This is most likely the culprit (go 1.17 should be used):
|
I have amended the Dockerfile so that go 1.17 gets installed (not available in Fedora 34) |
836686d
to
fbcb096
Compare
/ok-to-test |
I had to amend the github action for unit tests:
|
Next issue:
It seems that the main change is the addition of a grpc field, which increases the size of the CSV CRD. I am missing some background. I will have a look but any hint is welcome |
I dont think this is a grpc problem. This seems to be a part of the install script where we use |
If you don't want the annotation |
*crds.OperatorGroup(), | ||
*crds.Operator(), | ||
*crds.OperatorCondition(), | ||
CRDs: []*apiextensionsv1.CustomResourceDefinition{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting - it looks like this was changed in kubernetes-sigs/controller-runtime@a14a68c but if I'm reading that commit tagging right, it's only available in v0.11.0 versions of c-r but we still replace pin to my fork's v0.10.x version. Any idea on why we need to make this change? In order to make CI happy as we're pulling in the latest envtest binary now, so there's some potential skew between our c-r version and entest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have missed something:
require (
[...]
sigs.k8s.io/controller-runtime v0.11.0
[...]
replace (
sigs.k8s.io/controller-runtime v0.10.0 => github.com/timflannagan/controller-runtime v0.10.1-0.20211210161403-6756a4203e70
The replace version v0.10.0 does not match the request version v0.11.0. This means that v0.11.0 was in use. With your version of controller-runtime there are incompatibilities with the logr version brought by o-f/api, for instance:
vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:79:12: undefined: logr.WithCallDepth
vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:163:2: cannot use res (type *DelegatingLogger) as type logr.Logger in return argument
I am looking at: #2353
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm somewhat blanking on the overall context as to why that fork was necessary when bumping the c-r dependency in OLM to v0.10.x at the time. I remember seeing e2e was constantly failing despite the changes being fairly innocuous, and we were under the impression that Joe's upstream fix solved the issue we originally had to hack around. Let me see if I can dig up any more context today, but given the current e2e checks are passing, it's possible we had misdiagnosed the root case at the time.
@fgiloux I think we're getting closer to the finish line with these changes. Apologies for the slow turn around over the past week, but it seems like the github UI still doesn't handle large changesets well, so overall review time is substantially increased due to pages becoming unresponsive. I think we need to do the following to push this along:
Is there anything that I'm missing? |
Besides Kube and go versions and there is a fair amount of tests for controlling them the biggest change is in regards to the logging stack. It may be good to make a sanity check to ensure that logs have not degraded. |
@timflannagan @perdasilva I have rebased the PR. It would be good to get it merged now. |
I think this is largely read to go but I haven't verified the controller-runtime changes yet. If we're bumping the c-r version, we should also remove the replace pin from the root go.mod. |
I have done that as well. Let see how the tests go. |
IIRC, we ran into these two specific test case failures last time we tried to bump the c-r version (hence why the replace pin to my fork was needed): https://github.com/operator-framework/operator-lifecycle-manager/runs/4942485265?check_suite_focus=true |
@fgiloux I'm going to respin the e2e tests again to double-check whether we're running into the same test case failures last time we tried bumping the c-r version to v0.11.x. |
@timflannagan I have been investigating locally. "OperatorCondition Upgradeable type and overrides " passed but "Operator API when a subscription to a package exists [It] should automatically adopt components" consistently failed.
Looking at the logs after failure I can see:
|
@timflannagan I spent more time debugging the e2e test mentioned in my previous comment. I got it to be successful locally but there are quite a few things that don't look right and sorry if I am missing something. Here are the points that are not what I would expect:
The e2e test could be tweaked to pass by ensuring a virgin environment before it runs but it seems to highlight something that it is genuinely suspicious. I doubt that this is directly due to the dependency bump. It may only exacerbate an existing behavior. |
@fgiloux Nice, I've seen similar issues locally when debugging these tests further. I had opened #2518 which aimed to fixed that AfterEach issue but I haven't had the time to revisit it recently. That Operator controller has some known caching issues due to the way we track the last resourceVersion that we processed, so it's possible that you can delete all the requisite underlying resources (e.g. SA, CRDs, etc.) and the Operator resource will still be created. I tried to fix that behavior (without fixing the root case) by firing off a DeleteCollection call at the top-level e2e spec. If I remember correctly with the last c-r bump, we were running into constant e2e test failures in the adoption and operator controllers, where both of those controllers weren't able to adopt resources (note: it was the ServiceAccount resource nearly 100% of the time) and propagate them to the operator resource's status sub-resource. That ServiceAccount resource was registered as a metadata-only watch when instantiating both controllers, and I was under the impression we might've been missing cache events at the informer level, and the new upstream fix in the v0.11.x release didn't work for us. We don't have to fix the test setup issues in this PR, but I'd like to avoid a situation where OLM's existing suite is already extremely flaky, and becomes more flaky due to a dependency bump. |
@timflannagan I forgot one point:
There are many points in #2518. Too bad that it has not been merged yet. For my own sake what is the point of tracking the last resourceVersion and having different logic based on it? It seems not aligned with general practices. What am I missing?
It could be but when I checked the SA did not have the label. If the controller logic for the operator was working it should add the reference to the operator when the label is added to the SA but it did not in my test. |
The following test is failing locally for me, looking into it now.
|
- logr updated from v0.4.0 to v1.2.0 (cascading implications) - Bumped to go 1.17 in Dockerfile - Amended the github action for unit tests: - the envtest binaries were very very old (K8 1.16) - the distribution of kubebuilder binaries and envtest changed in the meantime - kubebuilder is not really needed for the tests and not part of the envtest installation -> check disabled in Makefile - Used `kubectl create` instead of `kubectl apply` to avoid too long annotations for CRDs Signed-off-by: Frederic Giloux <[email protected]>
Closed in favor of #2617 |
Signed-off-by: Frederic Giloux [email protected]
Description of the change:
Bump k8s dependencies to 1.23 to align with o-f/api
Also as a side effect: newer logr version: v0.4 => v1.2
Motivation for the change:
Alignment with o-f/api, alignment with newer k8 versions.
Reviewer Checklist
/doc
Closes: #2573