-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
feat(crons): Implement "High Volume" consumer driven clock task dispatch #54204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(crons): Implement "High Volume" consumer driven clock task dispatch #54204
Conversation
7ee4d83
to
0a88749
Compare
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #54204 +/- ##
===========================================
+ Coverage 69.50% 79.61% +10.11%
===========================================
Files 4976 4982 +6
Lines 210984 211118 +134
Branches 35952 35964 +12
===========================================
+ Hits 146644 168090 +21446
+ Misses 59169 37860 -21309
+ Partials 5171 5168 -3
|
This is a partial implementation of GH-53661. This implements the "High Volume" mode, where check-ins from the consumer are essentially the 'clock pulses' that are used to dispatch the monitor tasks each minute. This change does NOT actually dispatch the tasks, but it does send some telemetry each time the tasks would be dispatched, as well as capture error conditions when the clock skips.
0a88749
to
30f80b7
Compare
sentry_sdk.capture_message("Monitor task dispatch minute skipped") | ||
|
||
_dispatch_tasks(ts) | ||
metrics.incr("monitors.tassk.triggered_via_high_volume_clock") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably useful to also record the difference between the actual time, and what the reference time is. We want this to ideally be no more than a few seconds late.
This is a follow up to GH-54204 as suggested by @fpacifici #53661 (comment) > Can we just have the pulse message running in both modes and treat > everything as high volume mode? Instead of having two modes, we can simply always use the same logic for dispatching the monitor tasks on the minute roll-over, using the consumer as a clock. Previously the worry here was that in low-volume check-in situations nothing would drive the clock and we would need to have an external clock, with a different way to dispatch the tasks. But there is no need for a different way to dispatch the tasks, we can have an external clock that pulses messages into the topic and we can simply use the same logic already implemented to use the topic messages as a clock. This change removes the concept of "high volume" / "low volume" and adds the concept of a "clock_pulse" message to the consumer. In a follow up PR we will introduce the celery beat task which produces the clock_pulse messages.
This fixes an issue with the original implementation of GH-54204 when processing messages in a non-monotonic order. Typically kafka messages will be in order like such 12:59:58 12:59:59 01:00:00 01:00:01 01:00:01 01:00:02 However, because of how messages are shared into the kafka partitions we may end up with a secnario that looks like this partitions #1 #2 #3 12:59:58 01:00:00 01:00:01 12:59:59 01:00:01 01:00:02 With one consumer reading from each partition sequentially we would read these out as 12:59:58 01:00:00 01:00:01 12:59:59 <-- problematic skip backwards in time 01:00:01 01:00:02 Prior to this change, when we would process the task_trigger clock tick for the timestamp `12:59:59` after `01:00:01` our `GETSET` would update the key with an OLDER timestamps. When the next tick happens at `01:00:01` we would now tick for the `01:00:00` minute boundary again incorrectly. This change corrects this by first looking at the existing last timestamp value stored in redis, if that value is smaller than the reference timestamp we're about to tick for, do nothing, do not store the older reference timestamp.
This fixes an issue with the original implementation of GH-54204 when processing messages in a non-monotonic order. Typically kafka messages will be in order like such 12:59:58 12:59:59 01:00:00 01:00:01 01:00:01 01:00:02 However, because of how messages are shared into the kafka partitions we may end up with a secnario that looks like this partitions #1 #2 #3 12:59:58 01:00:00 01:00:01 12:59:59 01:00:01 01:00:02 With one consumer reading from each partition sequentially we would read these out as 12:59:58 01:00:00 01:00:01 12:59:59 <-- problematic skip backwards in time 01:00:01 01:00:02 Prior to this change, when we would process the task_trigger clock tick for the timestamp `12:59:59` after `01:00:01` our `GETSET` would update the key with an OLDER timestamps. When the next tick happens at `01:00:01` we would now tick for the `01:00:00` minute boundary again incorrectly. This change corrects this by first looking at the existing last timestamp value stored in redis, if that value is smaller than the reference timestamp we're about to tick for, do nothing, do not store the older reference timestamp.
This is a follow up to GH-54204 as suggested by @fpacifici #53661 (comment) > Can we just have the pulse message running in both modes and treat > everything as high volume mode? Instead of having two modes, we can simply always use the same logic for dispatching the monitor tasks on the minute roll-over, using the consumer as a clock. Previously the worry here was that in low-volume check-in situations nothing would drive the clock and we would need to have an external clock, with a different way to dispatch the tasks. But there is no need for a different way to dispatch the tasks, we can have an external clock that pulses messages into the topic and we can simply use the same logic already implemented to use the topic messages as a clock. This change removes the concept of "high volume" / "low volume" and adds the concept of a "clock_pulse" message to the consumer. In a follow up PR we will introduce the celery beat task which produces the clock_pulse messages.
This is a follow up to GH-54204 as suggested by @fpacifici #53661 (comment) > Can we just have the pulse message running in both modes and treat > everything as high volume mode? Instead of having two modes, we can simply always use the same logic for dispatching the monitor tasks on the minute roll-over, using the consumer as a clock. Previously the worry here was that in low-volume check-in situations nothing would drive the clock and we would need to have an external clock, with a different way to dispatch the tasks. But there is no need for a different way to dispatch the tasks, we can have an external clock that pulses messages into the topic and we can simply use the same logic already implemented to use the topic messages as a clock. This change removes the concept of "high volume" / "low volume" and adds the concept of a "clock_pulse" message to the consumer. In a follow up PR we will introduce the celery beat task which produces the clock_pulse messages.
This is a partial implementation of GH-53661.
This implements the "High Volume" mode, where check-ins from the consumer are essentially the 'clock pulses' that are used to dispatch the monitor tasks each minute.
This change does NOT actually dispatch the tasks, but it does send some telemetry each time the tasks would be dispatched, as well as capture error conditions when the clock skips.