fix(crons): Enable clock tick partition syncing #58026

evanpurkhiser · 2023-10-12T21:47:47Z

This is a follow up after GH-58003.

The clock which dispatches tasks will now only tick forward once all
partitions have been read up to to the synchronized time

evanpurkhiser · 2023-10-12T21:49:49Z

src/sentry/monitors/tasks.py

-    # The scenario where the reference_ts is older is likely due to a partition
-    # being slightly behind another partition that we've already read from
-    if precheck_last_ts is not None and precheck_last_ts >= reference_ts:
+    # The scenario where the slowest_part_ts is older may happen when our
+    # MONITOR_TASKS_PARTITION_CLOCKS set did not know about every partition the
+    # topic is responsible for. Older check-ins may be processed after newer
+    # ones in diferent topics. This should only happen if redis loses state.
+    if precheck_last_ts is not None and precheck_last_ts >= slowest_part_ts:


The case where the reference time (now slowest_part_time) is BEFORE the last time has changed. The comment now describes when this can happen.

The previous case was basically a symptom of partition desynchronization.

It can also happen if the consumer crashes without being able to save the offset, replaying messages is a common Kafka case

Ah yeah good catch. I'll update the PR separately

evanpurkhiser · 2023-10-12T21:53:26Z

src/sentry/monitors/tasks.py

            sentry_sdk.capture_message("Monitor task dispatch minute skipped")

-    _dispatch_tasks(ts)
+    _dispatch_tasks(tick)


This is subtly different.

Prior to this we were using the ts of when the kafka message was read. This was down to the microsecond. However, both tasks run from _dispatch_tasks ALREADY were trimming off the second and microsecond part of the datetime, so it was essentially datetime.fromtimestamp(reference_ts) before.

reference_ts is now slowest_part_ts and tick is datetime.fromtimestamp(slowest_part_ts) so this now matches.

This is a follow up after GH-58003. The clock which dispatches tasks will now only tick forward once all partitions have been read up to to the synchronized time

evanpurkhiser · 2023-10-12T21:55:01Z

tests/sentry/monitors/test_tasks.py

+    # produce missed check-ins since the monitor already may have already
+    # checked-in after and moved the `next_checkin_latest` forward.
+    #
+    # In practice this should almost never happen since we have a high volume of


We may want to deal with this in the future. But I don't think this is going to pose a major problem.

Though in backlog situations I do wonder if we'll read WAY more from one partition?

codecov · 2023-10-12T23:06:49Z

Codecov Report

Merging #58026 (93e5ea0) into master (f33a3f7) will decrease coverage by 0.01%.
Report is 9 commits behind head on master.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #58026      +/-   ##
==========================================
- Coverage   79.03%   79.02%   -0.01%     
==========================================
  Files        5130     5130              
  Lines      223044   223044              
  Branches    37560    37560              
==========================================
- Hits       176276   176269       -7     
- Misses      41135    41138       +3     
- Partials     5633     5637       +4

Files	Coverage Δ
src/sentry/monitors/tasks.py	`82.40% <100.00%> (ø)`

... and 8 files with indirect coverage changes

wedamija · 2023-10-13T18:03:23Z

src/sentry/monitors/tasks.py

-    # The scenario where the reference_ts is older is likely due to a partition
-    # being slightly behind another partition that we've already read from
-    if precheck_last_ts is not None and precheck_last_ts >= reference_ts:
+    # The scenario where the slowest_part_ts is older may happen when our
+    # MONITOR_TASKS_PARTITION_CLOCKS set did not know about every partition the
+    # topic is responsible for. Older check-ins may be processed after newer
+    # ones in diferent topics. This should only happen if redis loses state.
+    if precheck_last_ts is not None and precheck_last_ts >= slowest_part_ts:


It can also happen if the consumer crashes without being able to save the offset, replaying messages is a common Kafka case

evanpurkhiser requested a review from a team as a code owner October 12, 2023 21:47

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Oct 12, 2023

evanpurkhiser mentioned this pull request Oct 12, 2023

Crons: Guarantee lock-step clock across topic partitions #55821

Closed

evanpurkhiser commented Oct 12, 2023

View reviewed changes

vercel bot deployed to Preview October 12, 2023 21:50 View deployment

evanpurkhiser commented Oct 12, 2023

View reviewed changes

fix(crons): Enable clock tick partition syncing

93e5ea0

This is a follow up after GH-58003. The clock which dispatches tasks will now only tick forward once all partitions have been read up to to the synchronized time

evanpurkhiser force-pushed the evanpurkhiser/fix-crons-enable-clock-tick-partition-syncing branch from 0c3f863 to 93e5ea0 Compare October 12, 2023 21:54

evanpurkhiser commented Oct 12, 2023

View reviewed changes

vercel bot deployed to Preview October 12, 2023 21:57 View deployment

wedamija approved these changes Oct 13, 2023

View reviewed changes

evanpurkhiser merged commit bfb24f6 into master Oct 13, 2023

evanpurkhiser deleted the evanpurkhiser/fix-crons-enable-clock-tick-partition-syncing branch October 13, 2023 18:16

github-actions bot locked and limited conversation to collaborators Oct 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(crons): Enable clock tick partition syncing #58026

fix(crons): Enable clock tick partition syncing #58026

evanpurkhiser commented Oct 12, 2023 •

edited

Loading

evanpurkhiser Oct 12, 2023

wedamija Oct 13, 2023

evanpurkhiser Oct 13, 2023

evanpurkhiser Oct 12, 2023 •

edited

Loading

evanpurkhiser Oct 12, 2023

codecov bot commented Oct 12, 2023

wedamija Oct 13, 2023

fix(crons): Enable clock tick partition syncing #58026

fix(crons): Enable clock tick partition syncing #58026

Conversation

evanpurkhiser commented Oct 12, 2023 • edited Loading

evanpurkhiser Oct 12, 2023

Choose a reason for hiding this comment

wedamija Oct 13, 2023

Choose a reason for hiding this comment

evanpurkhiser Oct 13, 2023

Choose a reason for hiding this comment

evanpurkhiser Oct 12, 2023 • edited Loading

Choose a reason for hiding this comment

evanpurkhiser Oct 12, 2023

Choose a reason for hiding this comment

codecov bot commented Oct 12, 2023

Codecov Report

wedamija Oct 13, 2023

Choose a reason for hiding this comment

evanpurkhiser commented Oct 12, 2023 •

edited

Loading

evanpurkhiser Oct 12, 2023 •

edited

Loading