feat(crons): Implement "High Volume" consumer driven clock task dispatch #54204

evanpurkhiser · 2023-08-04T18:59:08Z

This is a partial implementation of GH-53661.

This implements the "High Volume" mode, where check-ins from the consumer are essentially the 'clock pulses' that are used to dispatch the monitor tasks each minute.

This change does NOT actually dispatch the tasks, but it does send some telemetry each time the tasks would be dispatched, as well as capture error conditions when the clock skips.

tests/sentry/monitors/test_monitor_consumer.py

src/sentry/monitors/consumers/monitor_consumer.py

codecov · 2023-08-04T19:34:50Z

Codecov Report

Merging #54204 (0a88749) into master (388b444) will increase coverage by 10.11%.
Report is 3 commits behind head on master.
The diff coverage is 32.43%.

❗ Current head 0a88749 differs from pull request most recent head 30f80b7. Consider uploading reports for the commit 30f80b7 to get more accurate results

Additional details and impacted files

@@             Coverage Diff             @@
##           master   #54204       +/-   ##
===========================================
+ Coverage   69.50%   79.61%   +10.11%     
===========================================
  Files        4976     4982        +6     
  Lines      210984   211118      +134     
  Branches    35952    35964       +12     
===========================================
+ Hits       146644   168090    +21446     
+ Misses      59169    37860    -21309     
+ Partials     5171     5168        -3

Files Changed	Coverage Δ
src/sentry/monitors/consumers/monitor_consumer.py	`80.80% <28.57%> (-9.88%)`	⬇️
src/sentry/conf/server.py	`91.60% <100.00%> (+0.02%)`	⬆️

... and 1093 files with indirect coverage changes

This is a partial implementation of GH-53661. This implements the "High Volume" mode, where check-ins from the consumer are essentially the 'clock pulses' that are used to dispatch the monitor tasks each minute. This change does NOT actually dispatch the tasks, but it does send some telemetry each time the tasks would be dispatched, as well as capture error conditions when the clock skips.

evanpurkhiser · 2023-08-04T20:21:58Z

src/sentry/monitors/consumers/monitor_consumer.py

+                    sentry_sdk.capture_message("Monitor task dispatch minute skipped")
+
+            _dispatch_tasks(ts)
+            metrics.incr("monitors.tassk.triggered_via_high_volume_clock")


Probably useful to also record the difference between the actual time, and what the reference time is. We want this to ideally be no more than a few seconds late.

src/sentry/monitors/consumers/monitor_consumer.py

@fpacifici

This is a follow up to GH-54204 as suggested by @fpacifici #53661 (comment) > Can we just have the pulse message running in both modes and treat > everything as high volume mode? Instead of having two modes, we can simply always use the same logic for dispatching the monitor tasks on the minute roll-over, using the consumer as a clock. Previously the worry here was that in low-volume check-in situations nothing would drive the clock and we would need to have an external clock, with a different way to dispatch the tasks. But there is no need for a different way to dispatch the tasks, we can have an external clock that pulses messages into the topic and we can simply use the same logic already implemented to use the topic messages as a clock. This change removes the concept of "high volume" / "low volume" and adds the concept of a "clock_pulse" message to the consumer. In a follow up PR we will introduce the celery beat task which produces the clock_pulse messages.

This fixes an issue with the original implementation of GH-54204 when processing messages in a non-monotonic order. Typically kafka messages will be in order like such 12:59:58 12:59:59 01:00:00 01:00:01 01:00:01 01:00:02 However, because of how messages are shared into the kafka partitions we may end up with a secnario that looks like this partitions #1 #2 #3 12:59:58 01:00:00 01:00:01 12:59:59 01:00:01 01:00:02 With one consumer reading from each partition sequentially we would read these out as 12:59:58 01:00:00 01:00:01 12:59:59 <-- problematic skip backwards in time 01:00:01 01:00:02 Prior to this change, when we would process the task_trigger clock tick for the timestamp `12:59:59` after `01:00:01` our `GETSET` would update the key with an OLDER timestamps. When the next tick happens at `01:00:01` we would now tick for the `01:00:00` minute boundary again incorrectly. This change corrects this by first looking at the existing last timestamp value stored in redis, if that value is smaller than the reference timestamp we're about to tick for, do nothing, do not store the older reference timestamp.

@fpacifici

This is a follow up to GH-54204 as suggested by @fpacifici #53661 (comment) > Can we just have the pulse message running in both modes and treat > everything as high volume mode? Instead of having two modes, we can simply always use the same logic for dispatching the monitor tasks on the minute roll-over, using the consumer as a clock. Previously the worry here was that in low-volume check-in situations nothing would drive the clock and we would need to have an external clock, with a different way to dispatch the tasks. But there is no need for a different way to dispatch the tasks, we can have an external clock that pulses messages into the topic and we can simply use the same logic already implemented to use the topic messages as a clock. This change removes the concept of "high volume" / "low volume" and adds the concept of a "clock_pulse" message to the consumer. In a follow up PR we will introduce the celery beat task which produces the clock_pulse messages.

@fpacifici

This is a follow up to GH-54204 as suggested by @fpacifici #53661 (comment) > Can we just have the pulse message running in both modes and treat > everything as high volume mode? Instead of having two modes, we can simply always use the same logic for dispatching the monitor tasks on the minute roll-over, using the consumer as a clock. Previously the worry here was that in low-volume check-in situations nothing would drive the clock and we would need to have an external clock, with a different way to dispatch the tasks. But there is no need for a different way to dispatch the tasks, we can have an external clock that pulses messages into the topic and we can simply use the same logic already implemented to use the topic messages as a clock. This change removes the concept of "high volume" / "low volume" and adds the concept of a "clock_pulse" message to the consumer. In a follow up PR we will introduce the celery beat task which produces the clock_pulse messages.

evanpurkhiser requested a review from a team as a code owner August 4, 2023 18:59

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Aug 4, 2023

vercel bot deployed to Preview August 4, 2023 19:01 View deployment

evanpurkhiser force-pushed the evanpurkhiser/feat-crons-implement-high-volume-consumer-driven-clock-task-dispatch branch from 7ee4d83 to 0a88749 Compare August 4, 2023 19:02

vercel bot deployed to Preview August 4, 2023 19:05 View deployment

rjo100 reviewed Aug 4, 2023

View reviewed changes

tests/sentry/monitors/test_monitor_consumer.py Show resolved Hide resolved

tests/sentry/monitors/test_monitor_consumer.py Show resolved Hide resolved

src/sentry/monitors/consumers/monitor_consumer.py Show resolved Hide resolved

evanpurkhiser force-pushed the evanpurkhiser/feat-crons-implement-high-volume-consumer-driven-clock-task-dispatch branch from 0a88749 to 30f80b7 Compare August 4, 2023 19:42

vercel bot deployed to Preview August 4, 2023 19:45 View deployment

evanpurkhiser commented Aug 4, 2023

View reviewed changes

rjo100 reviewed Aug 4, 2023

View reviewed changes

src/sentry/monitors/consumers/monitor_consumer.py Show resolved Hide resolved

rjo100 approved these changes Aug 4, 2023

View reviewed changes

evanpurkhiser merged commit a1c6ee4 into master Aug 4, 2023

evanpurkhiser deleted the evanpurkhiser/feat-crons-implement-high-volume-consumer-driven-clock-task-dispatch branch August 4, 2023 20:50

This was referenced Aug 7, 2023

Crons: Run monitor tasks check_missing + check_timeout via topic boundaries OR clock pulses #53661

Closed

ref(crons): Simplify monitor clock implementation #54388

Merged

evanpurkhiser mentioned this pull request Aug 9, 2023

ref(crons): Guard clock ticks against desynced-partitions #54489

Merged

github-actions bot locked and limited conversation to collaborators Aug 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(crons): Implement "High Volume" consumer driven clock task dispatch #54204

feat(crons): Implement "High Volume" consumer driven clock task dispatch #54204

evanpurkhiser commented Aug 4, 2023

codecov bot commented Aug 4, 2023 •

edited

Loading

evanpurkhiser Aug 4, 2023

feat(crons): Implement "High Volume" consumer driven clock task dispatch #54204

feat(crons): Implement "High Volume" consumer driven clock task dispatch #54204

Conversation

evanpurkhiser commented Aug 4, 2023

codecov bot commented Aug 4, 2023 • edited Loading

Codecov Report

evanpurkhiser Aug 4, 2023

Choose a reason for hiding this comment

codecov bot commented Aug 4, 2023 •

edited

Loading