Skip to content

Crons: Add locking mechanisms for crons clock driven fan-out tasks #58410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
11 tasks
evanpurkhiser opened this issue Oct 18, 2023 · 3 comments
Closed
11 tasks
Assignees

Comments

@evanpurkhiser
Copy link
Member

evanpurkhiser commented Oct 18, 2023

There exists a problem with the monitor tasks triggered by the clock ticks when the monitor check-in kafka topic is in a backlogged state.

Here is a diagram of what each clock tick does:

image

Each tick triggers the check_missing and check_timeout celery tasks. These tasks both fan out tasks to mark monitors as missed and check-ins as timed out.

Typically each tick happens about a minute apart. However, these ticks are driven by the check-in kafka topic. So a backlog in this topic may cause ticks to be produced in rapid succession (always in order). In this scenario we may publish multiple of these tasks into the celery task queue.

How does this affect each task?

check_missing

See the task code for context:

qs = (
# Monitors that have reached the latest checkin time
MonitorEnvironment.objects.filter(
monitor__type__in=[MonitorType.CRON_JOB],
next_checkin_latest__lte=current_datetime,
)
.exclude(
status__in=[
MonitorStatus.DISABLED,
MonitorStatus.PENDING_DELETION,
MonitorStatus.DELETION_IN_PROGRESS,
]
)
.exclude(
monitor__status__in=[
ObjectStatus.DISABLED,
ObjectStatus.PENDING_DELETION,
ObjectStatus.DELETION_IN_PROGRESS,
]
)[:MONITOR_LIMIT]
)
metrics.gauge("sentry.monitors.tasks.check_missing.count", qs.count(), sample_rate=1.0)
for monitor_environment in qs:
mark_environment_missing.delay(monitor_environment.id, current_datetime)

When both tasks are put into celery, if the second tick (12:02) happens to run at the same time as the first tick (12:01):

  1. We may mark monitors as missed for the SECOND minute and not the first. This is because we use the next_checkout_latest column on the table to determine which monitors are missed. If we mark monitors in the second minute as missed before the first the first minute will never get marked as missed since next_checkout_latest will have already moved forward.

  2. When we skip minutes this will affect threshold calculations for creating monitor incidents.

check_timeout

This task appears to be unaffected. Since this task simply looks for check-ins that need to be marked as timed out, it's mostly idempotent.

If one tick marks earlier tick check-ins as timed out out of order it does not matter since

Proposed solution

Instead of having the ticks dispatch celery tasks we can introduce a kafka topic for the clock ticks and a consumer of those clock ticks.

The clock tick consumer will be responsible for doing the following work for each clock tick. This work MUST be completed before consuming the next tick.

  1. Determine which monitor environments should have missed check-ins generated. This is determined based on the next_checkin_latest timestamp on a monitor environment.
  2. Determine which existing check-ins are past their timeout_at timestamp.
  3. Fan out tasks via another kafka topic to a consumer responsible for actually doing the work of generating missed check-ins and updating timedout check-ins. This MUST be partitioned by the monitor environment, since this work needs to happen in order.
  4. Importantly, each task must verify that it still needs to do the expected work (mark missing, mark timeout). Since clock ticks may happen faster than work can be completed, the next_checkin_latest and check-ins that need to be timed-out may have been detected by later clock ticks, but will have the necessary work done in earlier tasks from earlier clock ticks.

Work required to complete this task

@wedamija
Copy link
Member

Since we're using a multi proc executor, doesn't this introduce the possibility of running things out of order again?

@fpacifici
Copy link
Contributor

The clock tick consumer will be responsible for doing the following work for each clock tick. This work MUST be completed before consuming the next tick.

Can't you simply dispatch the jobs to a Kafka topic partitioned by monitor_id ? This would guarantee that check_ticks and check_timeout logic for the same monitor is always executed strictly in order. So you would never run into a scenario where 12:01 and 12:02 are executed concurrently or out of order.

There are several reasons why trying to synchronize a consumer with another one through a shared storage is problematic:

  • Another component on the critical path that can fail and block the entire pipeline (redis)
  • Failure of executing one tick can block the entire scheduler if the lock is not removed from redis. This is something that can easily fail. (OOM for example).
  • Slows you down when you are trying to burn a backlog.

@evanpurkhiser
Copy link
Member Author

I think you're right.

There is a caveat that needs to be explained and well-documented though.

Each clock tick is responsible for figuring out which monitors are past their next_checkin_latest as well as which check-ins are past their timeout_at.

If we consume clock ticks as fast as we can, in a scenario where the clock ticks quickly (eg, ingest-monitors backlog burndown), we will continually find all the same monitors and check-ins from earlier clock ticks that have not actually had the work done yet to mark them missing and to mark the check-ins as timed out. Originally when I was thinking about this, I made the mistaken assumption that the work for one clock tick MUST happen before the next clock tick, but as @fpacifici said, we only need to make sure the work happens in order for each monitor.

It will be important that each task verifies that the missed check-in should still be created and verifies that check-in has not already been marked as timed-out

evanpurkhiser added a commit that referenced this issue May 6, 2024
When we detect a clock tick it is possible that we may have skipped a
tick when a monitor ingest partition is very slow and does not contain a
message for a entire minute (and other partitions have already moved
multiple minutes forward).

Previously we would log this and avoid producing a clock tick for these
skipped minute(s) as we were using celery to dispatch the check_missing
and check_timeout tasks. Since the celery tasks would be produced
back-to-back it wasn't unlikely they would be processed out of order

Since the completion of GH-58410 we are now guaranteed that clock tick
tasks are processed in order.
evanpurkhiser added a commit that referenced this issue May 6, 2024
When we detect a clock tick it is possible that we may have skipped a
tick when a monitor ingest partition is very slow and does not contain a
message for a entire minute (and other partitions have already moved
multiple minutes forward).

Previously we would log this and avoid producing a clock tick for these
skipped minute(s) as we were using celery to dispatch the check_missing
and check_timeout tasks. Since the celery tasks would be produced
back-to-back it wasn't unlikely they would be processed out of order

Since the completion of GH-58410 we are now guaranteed that clock tick
tasks are processed in order.
evanpurkhiser added a commit that referenced this issue May 6, 2024
When we detect a clock tick it is possible that we may have skipped a
tick when a monitor ingest partition is very slow and does not contain a
message for a entire minute (and other partitions have already moved
multiple minutes forward).

Previously we would log this and avoid producing a clock tick for these
skipped minute(s) as we were using celery to dispatch the check_missing
and check_timeout tasks. Since the celery tasks would be produced
back-to-back it wasn't unlikely they would be processed out of order

Since the completion of GH-58410 we are now guaranteed that clock tick
tasks are processed in order.
evanpurkhiser added a commit that referenced this issue May 7, 2024
With the completion of GH-58410 we should be able to support increasing
the task load for marking muted monitors as missing.
@github-actions github-actions bot locked and limited conversation to collaborators May 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants