Crons: Add locking mechanisms for crons clock driven fan-out tasks #58410

evanpurkhiser · 2023-10-18T23:01:08Z

There exists a problem with the monitor tasks triggered by the clock ticks when the monitor check-in kafka topic is in a backlogged state.

Here is a diagram of what each clock tick does:

Each tick triggers the check_missing and check_timeout celery tasks. These tasks both fan out tasks to mark monitors as missed and check-ins as timed out.

Typically each tick happens about a minute apart. However, these ticks are driven by the check-in kafka topic. So a backlog in this topic may cause ticks to be produced in rapid succession (always in order). In this scenario we may publish multiple of these tasks into the celery task queue.

How does this affect each task?

`check_missing`

See the task code for context:

sentry/src/sentry/monitors/tasks.py

Lines 224 to 248 in a3264a8

    
           qs = ( 
        
               # Monitors that have reached the latest checkin time 
        
               MonitorEnvironment.objects.filter( 
        
                   monitor__type__in=[MonitorType.CRON_JOB], 
        
                   next_checkin_latest__lte=current_datetime, 
        
               ) 
        
               .exclude( 
        
                   status__in=[ 
        
                       MonitorStatus.DISABLED, 
        
                       MonitorStatus.PENDING_DELETION, 
        
                       MonitorStatus.DELETION_IN_PROGRESS, 
        
                   ] 
        
               ) 
        
               .exclude( 
        
                   monitor__status__in=[ 
        
                       ObjectStatus.DISABLED, 
        
                       ObjectStatus.PENDING_DELETION, 
        
                       ObjectStatus.DELETION_IN_PROGRESS, 
        
                   ] 
        
               )[:MONITOR_LIMIT] 
        
           ) 
        
           metrics.gauge("sentry.monitors.tasks.check_missing.count", qs.count(), sample_rate=1.0) 
        
           for monitor_environment in qs: 
        
               mark_environment_missing.delay(monitor_environment.id, current_datetime)

When both tasks are put into celery, if the second tick (12:02) happens to run at the same time as the first tick (12:01):

We may mark monitors as missed for the SECOND minute and not the first. This is because we use the next_checkout_latest column on the table to determine which monitors are missed. If we mark monitors in the second minute as missed before the first the first minute will never get marked as missed since next_checkout_latest will have already moved forward.
When we skip minutes this will affect threshold calculations for creating monitor incidents.

`check_timeout`

This task appears to be unaffected. Since this task simply looks for check-ins that need to be marked as timed out, it's mostly idempotent.

If one tick marks earlier tick check-ins as timed out out of order it does not matter since

Proposed solution

Instead of having the ticks dispatch celery tasks we can introduce a kafka topic for the clock ticks and a consumer of those clock ticks.

The clock tick consumer will be responsible for doing the following work for each clock tick. This work MUST be completed before consuming the next tick.

Determine which monitor environments should have missed check-ins generated. This is determined based on the next_checkin_latest timestamp on a monitor environment.
Determine which existing check-ins are past their timeout_at timestamp.
Fan out tasks via another kafka topic to a consumer responsible for actually doing the work of generating missed check-ins and updating timedout check-ins. This MUST be partitioned by the monitor environment, since this work needs to happen in order.
Importantly, each task must verify that it still needs to do the expected work (mark missing, mark timeout). Since clock ticks may happen faster than work can be completed, the next_checkin_latest and check-ins that need to be timed-out may have been detected by later clock ticks, but will have the necessary work done in earlier tasks from earlier clock ticks.

Work required to complete this task

The text was updated successfully, but these errors were encountered:

wedamija · 2023-10-19T18:31:42Z

Since we're using a multi proc executor, doesn't this introduce the possibility of running things out of order again?

fpacifici · 2024-04-26T07:26:20Z

The clock tick consumer will be responsible for doing the following work for each clock tick. This work MUST be completed before consuming the next tick.

Can't you simply dispatch the jobs to a Kafka topic partitioned by monitor_id ? This would guarantee that check_ticks and check_timeout logic for the same monitor is always executed strictly in order. So you would never run into a scenario where 12:01 and 12:02 are executed concurrently or out of order.

There are several reasons why trying to synchronize a consumer with another one through a shared storage is problematic:

Another component on the critical path that can fail and block the entire pipeline (redis)
Failure of executing one tick can block the entire scheduler if the lock is not removed from redis. This is something that can easily fail. (OOM for example).
Slows you down when you are trying to burn a backlog.

evanpurkhiser · 2024-04-26T19:34:14Z

I think you're right.

There is a caveat that needs to be explained and well-documented though.

Each clock tick is responsible for figuring out which monitors are past their next_checkin_latest as well as which check-ins are past their timeout_at.

If we consume clock ticks as fast as we can, in a scenario where the clock ticks quickly (eg, ingest-monitors backlog burndown), we will continually find all the same monitors and check-ins from earlier clock ticks that have not actually had the work done yet to mark them missing and to mark the check-ins as timed out. Originally when I was thinking about this, I made the mistaken assumption that the work for one clock tick MUST happen before the next clock tick, but as @fpacifici said, we only need to make sure the work happens in order for each monitor.

It will be important that each task verifies that the missed check-in should still be created and verifies that check-in has not already been marked as timed-out

When we detect a clock tick it is possible that we may have skipped a tick when a monitor ingest partition is very slow and does not contain a message for a entire minute (and other partitions have already moved multiple minutes forward). Previously we would log this and avoid producing a clock tick for these skipped minute(s) as we were using celery to dispatch the check_missing and check_timeout tasks. Since the celery tasks would be produced back-to-back it wasn't unlikely they would be processed out of order Since the completion of GH-58410 we are now guaranteed that clock tick tasks are processed in order.

With the completion of GH-58410 we should be able to support increasing the task load for marking muted monitors as missing.

evanpurkhiser mentioned this issue Mar 6, 2024

Cron monitor sometimes create duplicate incidents per monitor #66461

Closed

evanpurkhiser self-assigned this Apr 25, 2024

This was referenced Apr 25, 2024

feat(crons): Add monitors-clock-tick getsentry/sentry-kafka-schemas#258

Merged

feat(crons): Add topic to support monitor clock tasks getsentry/sentry-kafka-schemas#259

Merged

This was referenced May 6, 2024

feat(crons): Backfill missed clock-tick dispatches #70338

Merged

ref(crons): Produce missed check-ins for muted monitors #70343

Merged

evanpurkhiser added a commit that referenced this issue May 7, 2024

ref(crons): Produce missed check-ins for muted monitors (#70343)

e1bbfc9

With the completion of GH-58410 we should be able to support increasing the task load for marking muted monitors as missing.

evanpurkhiser closed this as completed May 7, 2024

github-actions bot locked and limited conversation to collaborators May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Crons: Add locking mechanisms for crons clock driven fan-out tasks #58410

Crons: Add locking mechanisms for crons clock driven fan-out tasks #58410

evanpurkhiser commented Oct 18, 2023 •

edited

Loading

wedamija commented Oct 19, 2023

Uh oh!

fpacifici commented Apr 26, 2024

Uh oh!

evanpurkhiser commented Apr 26, 2024

Uh oh!

Uh oh!

Crons: Add locking mechanisms for crons clock driven fan-out tasks #58410

Crons: Add locking mechanisms for crons clock driven fan-out tasks #58410

Comments

evanpurkhiser commented Oct 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

check_missing

check_timeout

Proposed solution

wedamija commented Oct 19, 2023

Uh oh!

fpacifici commented Apr 26, 2024

Uh oh!

evanpurkhiser commented Apr 26, 2024

Uh oh!

evanpurkhiser commented Oct 18, 2023 •

edited

Loading

`check_missing`

`check_timeout`