Skip to content

Commit 9891064

Browse files
csvirixstefankmetacosm
authored
blog: primary resource caching (#2815)
Signed-off-by: Attila Mészáros <[email protected]> Signed-off-by: Chris Laprun <[email protected]> Co-authored-by: Martin Stefanko <[email protected]> Co-authored-by: Chris Laprun <[email protected]>
1 parent 125206e commit 9891064

File tree

1 file changed

+92
-0
lines changed

1 file changed

+92
-0
lines changed
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
---
2+
title: How to guarantee allocated values for next reconciliation
3+
date: 2025-05-22
4+
author: >-
5+
[Attila Mészáros](https://github.com/csviri) and [Chris Laprun](https://github.com/metacosm)
6+
---
7+
8+
We recently released v5.1 of Java Operator SDK (JOSDK). One of the highlights of this release is related to a topic of
9+
so-called
10+
[allocated values](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#representing-allocated-values
11+
).
12+
13+
To describe the problem, let's say that our controller needs to create a resource that has a generated identifier, i.e.
14+
a resource which identifier cannot be directly derived from the custom resource's desired state as specified in its
15+
`spec` field. To record the fact that the resource was successfully created, and to avoid attempting to
16+
recreate the resource again in subsequent reconciliations, it is typical for this type of controller to store the
17+
generated identifier in the custom resource's `status` field.
18+
19+
The Java Operator SDK relies on the informers' cache to retrieve resources. These caches, however, are only guaranteed
20+
to be eventually consistent. It could happen that, if some other event occurs, that would result in a new
21+
reconciliation, **before** the update that's been made to our resource status has the chance to be propagated first to
22+
the cluster and then back to the informer cache, that the resource in the informer cache does **not** contain the latest
23+
version as modified by the reconciler. This would result in a new reconciliation where the generated identifier would be
24+
missing from the resource status and, therefore, another attempt to create the resource by the reconciler, which is not
25+
what we'd like.
26+
27+
Java Operator SDK now provides a utility class [
28+
`PrimaryUpdateAndCacheUtils`](https://github.com/operator-framework/java-operator-sdk/blob/main/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/api/reconciler/PrimaryUpdateAndCacheUtils.java)
29+
to handle this particular use case. Using that overlay cache, your reconciler is guaranteed to see the most up-to-date
30+
version of the resource on the next reconciliation:
31+
32+
```java
33+
34+
@Override
35+
public UpdateControl<StatusPatchCacheCustomResource> reconcile(
36+
StatusPatchCacheCustomResource resource,
37+
Context<StatusPatchCacheCustomResource> context) {
38+
39+
// omitted code
40+
41+
var freshCopy = createFreshCopy(resource); // need fresh copy just because we use the SSA version of update
42+
freshCopy
43+
.getStatus()
44+
.setValue(statusWithAllocatedValue());
45+
46+
// using the utility instead of update control to patch the resource status
47+
var updated =
48+
PrimaryUpdateAndCacheUtils.ssaPatchStatusAndCacheResource(resource, freshCopy, context);
49+
return UpdateControl.noUpdate();
50+
}
51+
```
52+
53+
How does `PrimaryUpdateAndCacheUtils` work?
54+
There are multiple ways to solve this problem, but ultimately, we only provide the solution described below. If you
55+
want to dig deep in alternatives, see
56+
this [PR](https://github.com/operator-framework/java-operator-sdk/pull/2800/files).
57+
58+
The trick is to intercept the resource that the reconciler updated and cache that version in an additional cache on top
59+
of the informer's cache. Subsequently, if the reconciler needs to read the resource, the SDK will first check if it is
60+
in the overlay cache and read it from there if present, otherwise read it from the informer's cache. If the informer
61+
receives an event with a fresh resource, we always remove the resource from the overlay cache, since that is a more
62+
recent resource. But this **works only** if the reconciler updates the resource using **optimistic locking**.
63+
If the update fails on conflict, because the resource has already been updated on the cluster before we got
64+
the chance to get our update in, we simply wait and poll the informer cache until the new resource version from the
65+
server appears in the informer's cache,
66+
and then try to apply our updates to the resource again using the updated version from the server, again with optimistic
67+
locking.
68+
69+
So why is optimistic locking required? We hinted at it above, but the gist of it, is that if another party updates the
70+
resource before we get a chance to, we wouldn't be able to properly handle the resulting situation correctly in all
71+
cases. The informer would receive that new event before our own update would get a chance to propagate. Without
72+
optimistic locking, there wouldn't be a fail-proof way to determine which update should prevail (i.e. which occurred
73+
first), in particular in the event of the informer losing the connection to the cluster or other edge cases (the joys of
74+
distributed computing!).
75+
76+
Optimistic locking simplifies the situation and provides us with stronger guarantees: if the update succeeds, then we
77+
can be sure we have the proper resource version in our caches. The next event will contain our update in all cases.
78+
Because we know that, we can also be sure that we can evict the cached resource in the overlay cache whenever we receive
79+
a new event. The overlay cache is only used if the SDK detects that the original resource (i.e. the one before we
80+
applied our status update in the example above) is still in the informer's cache.
81+
82+
The following diagram sums up the process:
83+
84+
```mermaid
85+
flowchart TD
86+
A["Update Resource with Lock"] --> B{"Is Successful"}
87+
B -- Fails on conflict --> D["Poll the Informer cache until resource updated"]
88+
D --> A
89+
B -- Yes --> n2{"Original resource still in informer cache?"}
90+
n2 -- Yes --> C["Cache the resource in overlay cache"]
91+
n2 -- No --> n3["Informer cache already contains up-to-date version, do not use overlay cache"]
92+
```

0 commit comments

Comments
 (0)