-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Crons: Add locking mechanisms for crons clock driven fan-out tasks #58410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Since we're using a multi proc executor, doesn't this introduce the possibility of running things out of order again? |
Can't you simply dispatch the jobs to a Kafka topic partitioned by monitor_id ? This would guarantee that check_ticks and check_timeout logic for the same monitor is always executed strictly in order. So you would never run into a scenario where 12:01 and 12:02 are executed concurrently or out of order. There are several reasons why trying to synchronize a consumer with another one through a shared storage is problematic:
|
I think you're right. There is a caveat that needs to be explained and well-documented though. Each clock tick is responsible for figuring out which monitors are past their If we consume clock ticks as fast as we can, in a scenario where the clock ticks quickly (eg, ingest-monitors backlog burndown), we will continually find all the same monitors and check-ins from earlier clock ticks that have not actually had the work done yet to mark them missing and to mark the check-ins as timed out. Originally when I was thinking about this, I made the mistaken assumption that the work for one clock tick MUST happen before the next clock tick, but as @fpacifici said, we only need to make sure the work happens in order for each monitor. It will be important that each task verifies that the missed check-in should still be created and verifies that check-in has not already been marked as timed-out |
When we detect a clock tick it is possible that we may have skipped a tick when a monitor ingest partition is very slow and does not contain a message for a entire minute (and other partitions have already moved multiple minutes forward). Previously we would log this and avoid producing a clock tick for these skipped minute(s) as we were using celery to dispatch the check_missing and check_timeout tasks. Since the celery tasks would be produced back-to-back it wasn't unlikely they would be processed out of order Since the completion of GH-58410 we are now guaranteed that clock tick tasks are processed in order.
When we detect a clock tick it is possible that we may have skipped a tick when a monitor ingest partition is very slow and does not contain a message for a entire minute (and other partitions have already moved multiple minutes forward). Previously we would log this and avoid producing a clock tick for these skipped minute(s) as we were using celery to dispatch the check_missing and check_timeout tasks. Since the celery tasks would be produced back-to-back it wasn't unlikely they would be processed out of order Since the completion of GH-58410 we are now guaranteed that clock tick tasks are processed in order.
When we detect a clock tick it is possible that we may have skipped a tick when a monitor ingest partition is very slow and does not contain a message for a entire minute (and other partitions have already moved multiple minutes forward). Previously we would log this and avoid producing a clock tick for these skipped minute(s) as we were using celery to dispatch the check_missing and check_timeout tasks. Since the celery tasks would be produced back-to-back it wasn't unlikely they would be processed out of order Since the completion of GH-58410 we are now guaranteed that clock tick tasks are processed in order.
With the completion of GH-58410 we should be able to support increasing the task load for marking muted monitors as missing.
Uh oh!
There was an error while loading. Please reload this page.
There exists a problem with the monitor tasks triggered by the clock ticks when the monitor check-in kafka topic is in a backlogged state.
Here is a diagram of what each clock tick does:
Each tick triggers the
check_missing
andcheck_timeout
celery tasks. These tasks both fan out tasks to mark monitors as missed and check-ins as timed out.Typically each tick happens about a minute apart. However, these ticks are driven by the check-in kafka topic. So a backlog in this topic may cause ticks to be produced in rapid succession (always in order). In this scenario we may publish multiple of these tasks into the celery task queue.
How does this affect each task?
check_missing
See the task code for context:
sentry/src/sentry/monitors/tasks.py
Lines 224 to 248 in a3264a8
When both tasks are put into celery, if the second tick (12:02) happens to run at the same time as the first tick (12:01):
We may mark monitors as missed for the SECOND minute and not the first. This is because we use the
next_checkout_latest
column on the table to determine which monitors are missed. If we mark monitors in the second minute as missed before the first the first minute will never get marked as missed sincenext_checkout_latest
will have already moved forward.When we skip minutes this will affect threshold calculations for creating monitor incidents.
check_timeout
This task appears to be unaffected. Since this task simply looks for check-ins that need to be marked as timed out, it's mostly idempotent.
If one tick marks earlier tick check-ins as timed out out of order it does not matter since
Proposed solution
Instead of having the ticks dispatch celery tasks we can introduce a kafka topic for the clock ticks and a consumer of those clock ticks.
The clock tick consumer will be responsible for doing the following work for each clock tick. This work MUST be completed before consuming the next tick.
next_checkin_latest
timestamp on a monitor environment.timeout_at
timestamp.next_checkin_latest
and check-ins that need to be timed-out may have been detected by later clock ticks, but will have the necessary work done in earlier tasks from earlier clock ticks.Work required to complete this task
The text was updated successfully, but these errors were encountered: