Handling ResourceVersion wrap around #516

chriskinsman · 2020-09-30T16:18:41Z

I have a watch setup that I used the resourceVersion as a high watermark for. The resourceVersion eventually will wrap around and restart at 0 once the cluster has been up for quite a while.

What strategy do folks use to catch this and reset the high watermark to zero. Normally you only increase the high watermark if the resourceVersion of the current api object is > than the current high watermark.

This special case you need to reset it to a lower value.

Will a status message get sent into the watch? Will the watch end with a particular error that can be caught to reset?

Hard to test since it can take awhile to get a k8s cluster resourceVersion to wrap around!

Thanks!

Chris

brendandburns · 2020-10-01T15:19:25Z

So watch should always be paired with a timeout and a list in a loop (that's what the informer does)

In this case, either the API server terminates the watch or it times out, in which case you re-list all objects and this new list object (e.g. PodList) should contain the resourceVersion that is the next "high water mark"

Note that calling it a "high water mark" is kind of a mis-nomer because you should not assume that resourceVersion is monotonically increasing as you note.

So the correct strategy is not to maintain any "high water mark" or such, but rather always pair list and watch and always use the resourceVersion returned by the list as the starting point for the watch.

chriskinsman · 2020-10-01T16:37:01Z

The problem with that is the timeouts for the watch are relatively short i.e. < 60 minutes and on restart you then see a large number of objects over again. Using resourceVersion as a high water mark is really the only way to prevent reprocessing objects multiple times that I can see.

Do most people just implement state machines on top of the watches and use a pattern or read current state, compare new object to current state and determine if allowed to transition to new state?

Thanks!

brendandburns · 2020-10-02T16:51:43Z

Well, for each resource you can keep track of the latest resourceVersion that you've seen in a hashtable

e.g. map<resourceVersion, resourceName> and quickly check for each object to see if the object has changed.

resourceVersions are not guaranteed to be monotonically increasing when delivered in an event stream so you really can't track a high water mark.

chriskinsman · 2020-10-02T17:50:19Z

Well, for each resource you can keep track of the latest resourceVersion that you've seen in a hashtable

Works well until you have a container/controller restart. Persisting it is fairly expensive in a cluster with a lot of changing resources, i.e. jobs that are constantly spinning up, completing, etc.

resourceVersions are not guaranteed to be monotonically increasing when delivered in an event stream so you really can't track a high water mark.

This is the first time I have seen anyone say this. I know they are not guaranteed to increment by 1 but they are always increasing in my experience unless they are a repeat of a resourceVersion that has already been sent.

Do you have a reference for this? I couldn't find anything in the Kubernetes docs in my multiple read throughs.

Thanks for your input on this. I do really appreciate you taking the time. I haven't seen much of a discussion anywhere on this.

brendandburns · 2020-10-03T22:34:20Z

There's a big discussion here:

kubernetes-client/python#609

Note that this is all pretty academic, because the resourceVersion in etcd is an int64, so it will take a long time before it overflows.

brendandburns · 2020-10-03T22:35:30Z

Regarding the "persisting it" you shouldn't persist it. When your controller restarts it has to re-list all the objects its interested in anyway, so you will rebuild it when the controller restarts.

(at least that's the general pattern I've seen in most places)

fejta-bot · 2021-01-01T23:27:50Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2021-02-01T00:11:28Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2021-03-03T00:57:15Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-03-03T00:57:17Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 1, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 1, 2021

k8s-ci-robot closed this as completed Mar 3, 2021

claudiubelu mentioned this issue Oct 15, 2024

k8s-dqlite spiking cpu core to 100% canonical/microk8s#3227

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling ResourceVersion wrap around #516

Handling ResourceVersion wrap around #516

chriskinsman commented Sep 30, 2020

brendandburns commented Oct 1, 2020

chriskinsman commented Oct 1, 2020

brendandburns commented Oct 2, 2020

chriskinsman commented Oct 2, 2020

brendandburns commented Oct 3, 2020

brendandburns commented Oct 3, 2020

fejta-bot commented Jan 1, 2021

fejta-bot commented Feb 1, 2021

fejta-bot commented Mar 3, 2021

k8s-ci-robot commented Mar 3, 2021

Handling ResourceVersion wrap around #516

Handling ResourceVersion wrap around #516

Comments

chriskinsman commented Sep 30, 2020

brendandburns commented Oct 1, 2020

chriskinsman commented Oct 1, 2020

brendandburns commented Oct 2, 2020

chriskinsman commented Oct 2, 2020

brendandburns commented Oct 3, 2020

brendandburns commented Oct 3, 2020

fejta-bot commented Jan 1, 2021

fejta-bot commented Feb 1, 2021

fejta-bot commented Mar 3, 2021

k8s-ci-robot commented Mar 3, 2021