Skip to content

Handling ResourceVersion wrap around #516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chriskinsman opened this issue Sep 30, 2020 · 10 comments
Closed

Handling ResourceVersion wrap around #516

chriskinsman opened this issue Sep 30, 2020 · 10 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@chriskinsman
Copy link
Contributor

I have a watch setup that I used the resourceVersion as a high watermark for. The resourceVersion eventually will wrap around and restart at 0 once the cluster has been up for quite a while.

What strategy do folks use to catch this and reset the high watermark to zero. Normally you only increase the high watermark if the resourceVersion of the current api object is > than the current high watermark.

This special case you need to reset it to a lower value.

Will a status message get sent into the watch? Will the watch end with a particular error that can be caught to reset?

Hard to test since it can take awhile to get a k8s cluster resourceVersion to wrap around!

Thanks!

Chris

@brendandburns
Copy link
Contributor

So watch should always be paired with a timeout and a list in a loop (that's what the informer does)

In this case, either the API server terminates the watch or it times out, in which case you re-list all objects and this new list object (e.g. PodList) should contain the resourceVersion that is the next "high water mark"

Note that calling it a "high water mark" is kind of a mis-nomer because you should not assume that resourceVersion is monotonically increasing as you note.

So the correct strategy is not to maintain any "high water mark" or such, but rather always pair list and watch and always use the resourceVersion returned by the list as the starting point for the watch.

@chriskinsman
Copy link
Contributor Author

The problem with that is the timeouts for the watch are relatively short i.e. < 60 minutes and on restart you then see a large number of objects over again. Using resourceVersion as a high water mark is really the only way to prevent reprocessing objects multiple times that I can see.

Do most people just implement state machines on top of the watches and use a pattern or read current state, compare new object to current state and determine if allowed to transition to new state?

Thanks!

@brendandburns
Copy link
Contributor

Well, for each resource you can keep track of the latest resourceVersion that you've seen in a hashtable

e.g. map<resourceVersion, resourceName> and quickly check for each object to see if the object has changed.

resourceVersions are not guaranteed to be monotonically increasing when delivered in an event stream so you really can't track a high water mark.

@chriskinsman
Copy link
Contributor Author

Well, for each resource you can keep track of the latest resourceVersion that you've seen in a hashtable

Works well until you have a container/controller restart. Persisting it is fairly expensive in a cluster with a lot of changing resources, i.e. jobs that are constantly spinning up, completing, etc.

resourceVersions are not guaranteed to be monotonically increasing when delivered in an event stream so you really can't track a high water mark.

This is the first time I have seen anyone say this. I know they are not guaranteed to increment by 1 but they are always increasing in my experience unless they are a repeat of a resourceVersion that has already been sent.

Do you have a reference for this? I couldn't find anything in the Kubernetes docs in my multiple read throughs.

Thanks for your input on this. I do really appreciate you taking the time. I haven't seen much of a discussion anywhere on this.

@brendandburns
Copy link
Contributor

There's a big discussion here:

kubernetes-client/python#609

Note that this is all pretty academic, because the resourceVersion in etcd is an int64, so it will take a long time before it overflows.

@brendandburns
Copy link
Contributor

Regarding the "persisting it" you shouldn't persist it. When your controller restarts it has to re-list all the objects its interested in anyway, so you will rebuild it when the controller restarts.

(at least that's the general pattern I've seen in most places)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 1, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 1, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants