Skip to content

timeout and retry logic #40

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mMerlin opened this issue Aug 1, 2024 · 3 comments · May be fixed by #41
Open

timeout and retry logic #40

mMerlin opened this issue Aug 1, 2024 · 3 comments · May be fixed by #41

Comments

@mMerlin
Copy link
Contributor

mMerlin commented Aug 1, 2024

I have an issue with the way the NTP library works around updating the time synchronization that I would like to discuss. I have a sketch that keeps getting the current time using NTP.datetime(). When the update interval (.next_sync) expires, the library attempts to get a new NTP based time. When that fails (ETIMEDOUT is common in my environment), it is normally valid to continue using the previous monotonic_start_ns, and retry the update later.

I have a local patch that does that naively. It just traps the time out, and returns, allowing the calling (datetime) method to continue with the existing offset information. The problem with that, is that the update is reattempted every time datetime is referenced, often resulting in a whole string of timeouts, until the referenced server gets caught up. Probably hitting the server way too often, even though the packet was not successfully retrieved.

Should it be the responsibility of the NTP library to manage retry logic? That would seem to need updating next_sync on failures, or other logic to prevent additional update attempts immediately after a failure.

Alternatively, without my patch logic, the caller (of NTP.datetime) could trap the exception, and choose how to recover. A method in NTP to increase next_sync by _cache_seconds would facilitate that.

With any failure, wait, retry logic, it would probably be good to have a method that allows callers to check how stale the current synchronization is. A public 'last_synchronized' time stamp. As a struct_time, monotonic_ns, or a delta time value in either format.
Additional retry options could include using a list of servers. Sequencing through the list until one works. Once an update is successful, that server could become permanent, or fall back to the initial (default) server for the next update attempt. That would need to consider the possibly different cache time for each server. Need to skip servers that have been queried (successful or not) since their published cache value. Which will not be known until the first successful packet retrieval for that server.

I think the NTP library should be responsible for managing retries at some level
-- return time values even when update fails (if a previous offset value is stored)
-- prevent repeat hits on an NTP server after a timeout (honor any known cache_seconds value)
-- provide staleness indicators: ns since last update, cache_seconds
More extreme would be to allow an 'Iterable' as well as a simple str for the server, and provided logic to use them for retries. Potentially that would need to have matching iterables for port and cache_seconds as well. Prioritizing them based on individual cache_seconds would be an option.

I think the first 2 are 'needed', or some alternative to them. The rest is potentially useful feature bloat. I can write the code, but some idea of what will be 'acceptable' is needed.

Discussion?

@tannewt
Copy link
Member

tannewt commented Aug 5, 2024

It makes sense to add retry logic into this library. I think your plan sounds good.

@mMerlin
Copy link
Contributor Author

mMerlin commented Aug 12, 2024

I am working on code to manage retry logic for NTP synchronization. The documentation I have looked at for the existing adafruit_ntp module indicates that setting the cache_seconds to zero (the default value), uses the polling interval specified by the server. I did some research on that field, and find that does not make sense. The polling interval in the packet is not a 'suggestion' on what interval to use. It is a record of what the interval for the client has been from the view of the server. It was what the last interval 'was', not what the next interval 'should be'. Here is the result of a query to pi.ai:

Although the poll interval value is present in the NTP packet sent by the server, it is not strictly a suggestion or recommendation for the client. The primary purpose of the poll interval field in the packet is to allow the server to track and manage the client's polling behavior and load.

The NTP client uses its own algorithms to dynamically adjust the poll interval based on various factors, such as network conditions, synchronization performance, and stability. The poll interval value in the NTP packet from the server serves as a record of the client's poll interval during the previous communication rather than a direct suggestion for the client's next poll interval.

In summary, the poll interval value in the server's NTP packet is mainly used for informational purposes and tracking client behavior, while the client's local algorithms primarily determine the actual poll interval used for subsequent requests.

"Initially" using that value to set the interval to the next request does not make sense. I expect that most applications on microcontrollers, using circuitpython, do not need to maintain sub-millisecond accuracy with 'real' time. Using short retry intervals (for failures) seems appropriate. Once a successful synchronization has been done, the intervals for updates should be able to pushed way up. To something like once a day. If an update fails, only then change to shorter intervals, long enough to get a new time sync.

The maximum value that the polling interval in the packet can show is 2^17 seconds (about 36 hours).

Other perspectives?

@cinchcircuits
Copy link

This pull request does appear to be a very nice quality of life change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants