|
| 1 | +--- |
| 2 | +title: How to guarantee allocated values for next reconciliation |
| 3 | +date: 2025-05-22 |
| 4 | +author: >- |
| 5 | + [Attila Mészáros](https://github.com/csviri) and [Chris Laprun](https://github.com/metacosm) |
| 6 | +--- |
| 7 | + |
| 8 | +We recently released v5.1 of Java Operator SDK (JOSDK). One of the highlights of this release is related to a topic of |
| 9 | +so-called |
| 10 | +[allocated values](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#representing-allocated-values |
| 11 | +). |
| 12 | + |
| 13 | +To describe the problem, let's say that our controller needs to create a resource that has a generated identifier, i.e. |
| 14 | +a resource which identifier cannot be directly derived from the custom resource's desired state as specified in its |
| 15 | +`spec` field. To record the fact that the resource was successfully created, and to avoid attempting to |
| 16 | +recreate the resource again in subsequent reconciliations, it is typical for this type of controller to store the |
| 17 | +generated identifier in the custom resource's `status` field. |
| 18 | + |
| 19 | +The Java Operator SDK relies on the informers' cache to retrieve resources. These caches, however, are only guaranteed |
| 20 | +to be eventually consistent. It could happen that, if some other event occurs, that would result in a new |
| 21 | +reconciliation, **before** the update that's been made to our resource status has the chance to be propagated first to |
| 22 | +the cluster and then back to the informer cache, that the resource in the informer cache does **not** contain the latest |
| 23 | +version as modified by the reconciler. This would result in a new reconciliation where the generated identifier would be |
| 24 | +missing from the resource status and, therefore, another attempt to create the resource by the reconciler, which is not |
| 25 | +what we'd like. |
| 26 | + |
| 27 | +Java Operator SDK now provides a utility class [ |
| 28 | +`PrimaryUpdateAndCacheUtils`](https://github.com/operator-framework/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/reconciler/PrimaryUpdateAndCacheUtils.java) |
| 29 | +to handle this particular use case. Using that overlay cache, your reconciler is guaranteed to see the most up-to-date |
| 30 | +version of the resource on the next reconciliation: |
| 31 | + |
| 32 | +```java |
| 33 | + |
| 34 | +@Override |
| 35 | +public UpdateControl<StatusPatchCacheCustomResource> reconcile( |
| 36 | + StatusPatchCacheCustomResource resource, |
| 37 | + Context<StatusPatchCacheCustomResource> context) { |
| 38 | + |
| 39 | + // omitted code |
| 40 | + |
| 41 | + var freshCopy = createFreshCopy(resource); // need fresh copy just because we use the SSA version of update |
| 42 | + freshCopy |
| 43 | + .getStatus() |
| 44 | + .setValue(statusWithAllocatedValue()); |
| 45 | + |
| 46 | + // using the utility instead of update control to patch the resource status |
| 47 | + var updated = |
| 48 | + PrimaryUpdateAndCacheUtils.ssaPatchStatusAndCacheResource(resource, freshCopy, context); |
| 49 | + return UpdateControl.noUpdate(); |
| 50 | +} |
| 51 | +``` |
| 52 | + |
| 53 | +How does `PrimaryUpdateAndCacheUtils` work? |
| 54 | +There are multiple ways to solve this problem, but ultimately, we only provide the solution described below. If you |
| 55 | +want to dig deep in alternatives, see |
| 56 | +this [PR](https://github.com/operator-framework/java-operator-sdk/pull/2800/files). |
| 57 | + |
| 58 | +The trick is to intercept the resource that the reconciler updated and cache that version in an additional cache on top |
| 59 | +of the informer's cache. Subsequently, if the reconciler needs to read the resource, the SDK will first check if it is |
| 60 | +in the overlay cache and read it from there if present, otherwise read it from the informer's cache. If the informer |
| 61 | +receives an event with a fresh resource, we always remove the resource from the overlay cache, since that is a more |
| 62 | +recent resource. But this **works only** if the reconciler updates the resource using **optimistic locking**. |
| 63 | +If the update fails on conflict, because the resource has already been updated on the cluster before we got |
| 64 | +the chance to get our update in, we simply wait and poll the informer cache until the new resource version from the |
| 65 | +server appears in the informer's cache, |
| 66 | +and then try to apply our updates to the resource again using the updated version from the server, again with optimistic |
| 67 | +locking. |
| 68 | + |
| 69 | +So why is optimistic locking required? We hinted at it above, but the gist of it, is that if another party updates the |
| 70 | +resource before we get a chance to, we wouldn't be able to properly handle the resulting situation correctly in all |
| 71 | +cases. The informer would receive that new event before our own update would get a chance to propagate. Without |
| 72 | +optimistic locking, there wouldn't be a fail-proof way to determine which update should prevail (i.e. which occurred |
| 73 | +first), in particular in the event of the informer losing the connection to the cluster or other edge cases (the joys of |
| 74 | +distributed computing!). |
| 75 | + |
| 76 | +Optimistic locking simplifies the situation and provides us with stronger guarantees: if the update succeeds, then we |
| 77 | +can be sure we have the proper resource version in our caches. The next event will contain our update in all cases. |
| 78 | +Because we know that, we can also be sure that we can evict the cached resource in the overlay cache whenever we receive |
| 79 | +a new event. The overlay cache is only used if the SDK detects that the original resource (i.e. the one before we |
| 80 | +applied our status update in the example above) is still in the informer's cache. |
| 81 | + |
| 82 | +The following diagram sums up the process: |
| 83 | + |
| 84 | +```mermaid |
| 85 | +flowchart TD |
| 86 | + A["Update Resource with Lock"] --> B{"Is Successful"} |
| 87 | + B -- Fails on conflict --> D["Poll the Informer cache until resource updated"] |
| 88 | + D --> A |
| 89 | + B -- Yes --> n2{"Original resource still in informer cache?"} |
| 90 | + n2 -- Yes --> C["Cache the resource in overlay cache"] |
| 91 | + n2 -- No --> n3["Informer cache already contains up-to-date version, do not use overlay cache"] |
| 92 | +``` |
0 commit comments