Skip to content

feat(crons): Implement "High Volume" consumer driven clock task dispatch #54204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

evanpurkhiser
Copy link
Member

This is a partial implementation of GH-53661.

This implements the "High Volume" mode, where check-ins from the consumer are essentially the 'clock pulses' that are used to dispatch the monitor tasks each minute.

This change does NOT actually dispatch the tasks, but it does send some telemetry each time the tasks would be dispatched, as well as capture error conditions when the clock skips.

@evanpurkhiser evanpurkhiser requested a review from a team as a code owner August 4, 2023 18:59
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Aug 4, 2023
@evanpurkhiser evanpurkhiser force-pushed the evanpurkhiser/feat-crons-implement-high-volume-consumer-driven-clock-task-dispatch branch from 7ee4d83 to 0a88749 Compare August 4, 2023 19:02
@codecov
Copy link

codecov bot commented Aug 4, 2023

Codecov Report

Merging #54204 (0a88749) into master (388b444) will increase coverage by 10.11%.
Report is 3 commits behind head on master.
The diff coverage is 32.43%.

❗ Current head 0a88749 differs from pull request most recent head 30f80b7. Consider uploading reports for the commit 30f80b7 to get more accurate results

Additional details and impacted files
@@             Coverage Diff             @@
##           master   #54204       +/-   ##
===========================================
+ Coverage   69.50%   79.61%   +10.11%     
===========================================
  Files        4976     4982        +6     
  Lines      210984   211118      +134     
  Branches    35952    35964       +12     
===========================================
+ Hits       146644   168090    +21446     
+ Misses      59169    37860    -21309     
+ Partials     5171     5168        -3     
Files Changed Coverage Δ
src/sentry/monitors/consumers/monitor_consumer.py 80.80% <28.57%> (-9.88%) ⬇️
src/sentry/conf/server.py 91.60% <100.00%> (+0.02%) ⬆️

... and 1093 files with indirect coverage changes

This is a partial implementation of GH-53661.

This implements the "High Volume" mode, where check-ins from the
consumer are essentially the 'clock pulses' that are used to dispatch
the monitor tasks each minute.

This change does NOT actually dispatch the tasks, but it does send some
telemetry each time the tasks would be dispatched, as well as capture
error conditions when the clock skips.
@evanpurkhiser evanpurkhiser force-pushed the evanpurkhiser/feat-crons-implement-high-volume-consumer-driven-clock-task-dispatch branch from 0a88749 to 30f80b7 Compare August 4, 2023 19:42
sentry_sdk.capture_message("Monitor task dispatch minute skipped")

_dispatch_tasks(ts)
metrics.incr("monitors.tassk.triggered_via_high_volume_clock")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably useful to also record the difference between the actual time, and what the reference time is. We want this to ideally be no more than a few seconds late.

@evanpurkhiser evanpurkhiser merged commit a1c6ee4 into master Aug 4, 2023
@evanpurkhiser evanpurkhiser deleted the evanpurkhiser/feat-crons-implement-high-volume-consumer-driven-clock-task-dispatch branch August 4, 2023 20:50
evanpurkhiser added a commit that referenced this pull request Aug 8, 2023
This is a follow up to GH-54204 as suggested by @fpacifici

#53661 (comment)

> Can we just have the pulse message running in both modes and treat
> everything as high volume mode?

Instead of having two modes, we can simply always use the same logic for
dispatching the monitor tasks on the minute roll-over, using the
consumer as a clock.

Previously the worry here was that in low-volume check-in situations
nothing would drive the clock and we would need to have an external
clock, with a different way to dispatch the tasks. But there is no need
for a different way to dispatch the tasks, we can have an external clock
that pulses messages into the topic and we can simply use the same logic
already implemented to use the topic messages as a clock.

This change removes the concept of "high volume" / "low volume" and adds
the concept of a "clock_pulse" message to the consumer.

In a follow up PR we will introduce the celery beat task which produces
the clock_pulse messages.
evanpurkhiser added a commit that referenced this pull request Aug 9, 2023
This fixes an issue with the original implementation of GH-54204 when
processing messages in a non-monotonic order.

Typically kafka messages will be in order like such

  12:59:58
  12:59:59
  01:00:00
  01:00:01
  01:00:01
  01:00:02

However, because of how messages are shared into the kafka partitions we
may end up with a secnario that looks like this

  partitions

  #1        #2        #3
  12:59:58  01:00:00  01:00:01
  12:59:59  01:00:01  01:00:02

With one consumer reading from each partition sequentially we would read
these out as

  12:59:58
  01:00:00
  01:00:01
  12:59:59 <-- problematic skip backwards in time
  01:00:01
  01:00:02

Prior to this change, when we would process the task_trigger clock tick
for the timestamp `12:59:59` after `01:00:01` our `GETSET` would update
the key with an OLDER timestamps.

When the next tick happens at `01:00:01` we would now tick for the
`01:00:00` minute boundary again incorrectly.

This change corrects this by first looking at the existing last
timestamp value stored in redis, if that value is smaller than the
reference timestamp we're about to tick for, do nothing, do not store
the older reference timestamp.
evanpurkhiser added a commit that referenced this pull request Aug 9, 2023
This fixes an issue with the original implementation of GH-54204 when
processing messages in a non-monotonic order.

Typically kafka messages will be in order like such

	  12:59:58
	  12:59:59
	  01:00:00
	  01:00:01
	  01:00:01
	  01:00:02

However, because of how messages are shared into the kafka partitions we
may end up with a secnario that looks like this

	  partitions
	
	  #1        #2        #3
	  12:59:58  01:00:00  01:00:01
	  12:59:59  01:00:01  01:00:02

With one consumer reading from each partition sequentially we would read
these out as

	  12:59:58
	  01:00:00
	  01:00:01
	  12:59:59 <-- problematic skip backwards in time
	  01:00:01
	  01:00:02

Prior to this change, when we would process the task_trigger clock tick
for the timestamp `12:59:59` after `01:00:01` our `GETSET` would update
the key with an OLDER timestamps.

When the next tick happens at `01:00:01` we would now tick for the
`01:00:00` minute boundary again incorrectly.

This change corrects this by first looking at the existing last
timestamp value stored in redis, if that value is smaller than the
reference timestamp we're about to tick for, do nothing, do not store
the older reference timestamp.
evanpurkhiser added a commit that referenced this pull request Aug 10, 2023
This is a follow up to GH-54204 as suggested by @fpacifici

#53661 (comment)

> Can we just have the pulse message running in both modes and treat
> everything as high volume mode?

Instead of having two modes, we can simply always use the same logic for
dispatching the monitor tasks on the minute roll-over, using the
consumer as a clock.

Previously the worry here was that in low-volume check-in situations
nothing would drive the clock and we would need to have an external
clock, with a different way to dispatch the tasks. But there is no need
for a different way to dispatch the tasks, we can have an external clock
that pulses messages into the topic and we can simply use the same logic
already implemented to use the topic messages as a clock.

This change removes the concept of "high volume" / "low volume" and adds
the concept of a "clock_pulse" message to the consumer.

In a follow up PR we will introduce the celery beat task which produces
the clock_pulse messages.
evanpurkhiser added a commit that referenced this pull request Aug 10, 2023
This is a follow up to GH-54204 as suggested by @fpacifici

#53661 (comment)

> Can we just have the pulse message running in both modes and treat
> everything as high volume mode?

Instead of having two modes, we can simply always use the same logic for
dispatching the monitor tasks on the minute roll-over, using the
consumer as a clock.

Previously the worry here was that in low-volume check-in situations
nothing would drive the clock and we would need to have an external
clock, with a different way to dispatch the tasks. But there is no need
for a different way to dispatch the tasks, we can have an external clock
that pulses messages into the topic and we can simply use the same logic
already implemented to use the topic messages as a clock.

This change removes the concept of "high volume" / "low volume" and adds
the concept of a "clock_pulse" message to the consumer.

In a follow up PR we will introduce the celery beat task which produces
the clock_pulse messages.
@github-actions github-actions bot locked and limited conversation to collaborators Aug 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants