-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow reconciliations with JOSDK 5.0.3 release and failed reconciliation count in JOSDK metrics #2709
Comments
Would you happen to have the associated stacktraces? |
I mentioned in the ticket, there is no error or stacktrace available in the logs as all the reconciliation loops go fine with no further issue. It may be a inner error generated by the JOSDK or the Fabric8 client? Will it be useful if I increase the log level of the JOSDK to DEBUG and attached the log? |
@afalhambra-hivemq yes, pls turn on the debug level logs, see if there is something useful |
Also is this operator opensource? |
It is quite strange, effectively only this changed: |
well, cloning is a potentially costly operation so it's not that big of a surprise… |
This is the only stacktrace exception below I noticed when setting log level to DEBUG.
This is thrown where there is a rolling restart of the pods. And our reconciler is set to use SSA explicitly as: .withUseSSAToPatchPrimaryResource(true)
.withSSABasedCreateUpdateMatchForDependentResources(true)); It's also weird that the reconciliation loops are taking longer than with 5.0.2 version. |
If there are exceptions and retry in place then it makes sense that the reconciliations take longer as JOSDK will probably need multiple attempts to perform the same operation that was previously working the first time. |
@afalhambra-hivemq what is changed, that with SSA, since you I guess not passing a fresh resource the resourceVersion is now set on the resource, therefore now it performs optimistic locking (before was not), so what you can do is set Will add this into the blog post, but see also: https://javaoperatorsdk.io/blog/2025/02/25/from-legacy-approach-to-server-side-apply/ |
see also: #2710 |
To give some context here, this is happening in a rolling restart in the main reconciliation loop. We have some managed dependent resources, like in this case a StatefulSet: @KubernetesDependent(informer = @Informer(labelSelector = LABEL_SELECTOR))
public class StatefulSetResource extends CRUDKubernetesDependentResource<StatefulSet, HiveMQPlatform> { And that DR, the reconciliation loop is skipped, no action needed as it matches with the existing one:
Then the main reconciliation is invoked, and when patching the status of the custom resource there, this exception is thrown - but no further changes are done to the custom resource, it should be a fresh and up-to-date instance, or am I wrong? |
Well unless something modifies it in the background it should, but this is clearly about the conflict, and for status patch usually there is no reason to do optimistic locking, so would just adjust it to not have there resource version. |
Ok, I confirm that setting the But still curious as this issue should've been come up as well in the 5.0.2 release. Not sure how this PR for the 5.0.3 release can affect that now. |
That before that PR the resourceVersion was set to null explicitly for SSA to, not anymore after that PR. |
but mainly glad it helped, will close this issue if no objections |
Bug Report
What did you do?
We already migrated to the 5.0.2 JOSDK release with no issues but since the 5.0.3 release we are hitting some performance issues and degradation with slow reconciliations (greater than 1 second) on a rolling restart, along with some unexpected error metrics from the JOSDK:
What did you expect to see?
No slow reconciliation (greater than 1 second) are happening and no error metrics are displayed if the reconciliation is done successfully.
What did you see instead? Under which circumstances?
Slow reconciliations (greater than 1 second) and unexpected error count metrics with same
KubernetesClientException
when reconciling on a rolling restart.Environment
Kubernetes cluster type:
K3S
$ Mention java-operator-sdk version from pom.xml file
5.0.3
$ java -version
openjdk 21.0.3 2024-04-16 LTS
$ kubectl version
Client Version: v1.32.2
Kustomize Version: v5.5.0
Server Version: v1.32.0
Possible Solution
Additional context
The text was updated successfully, but these errors were encountered: