-
Notifications
You must be signed in to change notification settings - Fork 544
Handling ResourceVersion wrap around #516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So watch should always be paired with a timeout and a list in a loop (that's what the informer does) In this case, either the API server terminates the watch or it times out, in which case you re-list all objects and this new list object (e.g. PodList) should contain the resourceVersion that is the next "high water mark" Note that calling it a "high water mark" is kind of a mis-nomer because you should not assume that resourceVersion is monotonically increasing as you note. So the correct strategy is not to maintain any "high water mark" or such, but rather always pair list and watch and always use the resourceVersion returned by the list as the starting point for the watch. |
The problem with that is the timeouts for the watch are relatively short i.e. < 60 minutes and on restart you then see a large number of objects over again. Using resourceVersion as a high water mark is really the only way to prevent reprocessing objects multiple times that I can see. Do most people just implement state machines on top of the watches and use a pattern or read current state, compare new object to current state and determine if allowed to transition to new state? Thanks! |
Well, for each resource you can keep track of the latest e.g. map<resourceVersion, resourceName> and quickly check for each object to see if the object has changed. resourceVersions are not guaranteed to be monotonically increasing when delivered in an event stream so you really can't track a high water mark. |
Works well until you have a container/controller restart. Persisting it is fairly expensive in a cluster with a lot of changing resources, i.e. jobs that are constantly spinning up, completing, etc.
This is the first time I have seen anyone say this. I know they are not guaranteed to increment by 1 but they are always increasing in my experience unless they are a repeat of a resourceVersion that has already been sent. Do you have a reference for this? I couldn't find anything in the Kubernetes docs in my multiple read throughs. Thanks for your input on this. I do really appreciate you taking the time. I haven't seen much of a discussion anywhere on this. |
There's a big discussion here: Note that this is all pretty academic, because the resourceVersion in etcd is an int64, so it will take a long time before it overflows. |
Regarding the "persisting it" you shouldn't persist it. When your controller restarts it has to re-list all the objects its interested in anyway, so you will rebuild it when the controller restarts. (at least that's the general pattern I've seen in most places) |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-contributor-experience at kubernetes/community. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I have a watch setup that I used the resourceVersion as a high watermark for. The resourceVersion eventually will wrap around and restart at 0 once the cluster has been up for quite a while.
What strategy do folks use to catch this and reset the high watermark to zero. Normally you only increase the high watermark if the resourceVersion of the current api object is > than the current high watermark.
This special case you need to reset it to a lower value.
Will a status message get sent into the watch? Will the watch end with a particular error that can be caught to reset?
Hard to test since it can take awhile to get a k8s cluster resourceVersion to wrap around!
Thanks!
Chris
The text was updated successfully, but these errors were encountered: