Skip to content

fix(crons): Enable clock tick partition syncing #58026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

evanpurkhiser
Copy link
Member

@evanpurkhiser evanpurkhiser commented Oct 12, 2023

Part of #55821

This is a follow up after GH-58003.

The clock which dispatches tasks will now only tick forward once all
partitions have been read up to to the synchronized time

@evanpurkhiser evanpurkhiser requested a review from a team as a code owner October 12, 2023 21:47
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Oct 12, 2023
Comment on lines -144 to +143
# The scenario where the reference_ts is older is likely due to a partition
# being slightly behind another partition that we've already read from
if precheck_last_ts is not None and precheck_last_ts >= reference_ts:
# The scenario where the slowest_part_ts is older may happen when our
# MONITOR_TASKS_PARTITION_CLOCKS set did not know about every partition the
# topic is responsible for. Older check-ins may be processed after newer
# ones in diferent topics. This should only happen if redis loses state.
if precheck_last_ts is not None and precheck_last_ts >= slowest_part_ts:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case where the reference time (now slowest_part_time) is BEFORE the last time has changed. The comment now describes when this can happen.

The previous case was basically a symptom of partition desynchronization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can also happen if the consumer crashes without being able to save the offset, replaying messages is a common Kafka case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah good catch. I'll update the PR separately

sentry_sdk.capture_message("Monitor task dispatch minute skipped")

_dispatch_tasks(ts)
_dispatch_tasks(tick)
Copy link
Member Author

@evanpurkhiser evanpurkhiser Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is subtly different.

Prior to this we were using the ts of when the kafka message was read. This was down to the microsecond. However, both tasks run from _dispatch_tasks ALREADY were trimming off the second and microsecond part of the datetime, so it was essentially datetime.fromtimestamp(reference_ts) before.

reference_ts is now slowest_part_ts and tick is datetime.fromtimestamp(slowest_part_ts) so this now matches.

This is a follow up after GH-58003.

The clock which dispatches tasks will now only tick forward once all
partitions have been read up to to the synchronized time
@evanpurkhiser evanpurkhiser force-pushed the evanpurkhiser/fix-crons-enable-clock-tick-partition-syncing branch from 0c3f863 to 93e5ea0 Compare October 12, 2023 21:54
# produce missed check-ins since the monitor already may have already
# checked-in after and moved the `next_checkin_latest` forward.
#
# In practice this should almost never happen since we have a high volume of
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to deal with this in the future. But I don't think this is going to pose a major problem.

Though in backlog situations I do wonder if we'll read WAY more from one partition?

@codecov
Copy link

codecov bot commented Oct 12, 2023

Codecov Report

Merging #58026 (93e5ea0) into master (f33a3f7) will decrease coverage by 0.01%.
Report is 9 commits behind head on master.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #58026      +/-   ##
==========================================
- Coverage   79.03%   79.02%   -0.01%     
==========================================
  Files        5130     5130              
  Lines      223044   223044              
  Branches    37560    37560              
==========================================
- Hits       176276   176269       -7     
- Misses      41135    41138       +3     
- Partials     5633     5637       +4     
Files Coverage Δ
src/sentry/monitors/tasks.py 82.40% <100.00%> (ø)

... and 8 files with indirect coverage changes

Comment on lines -144 to +143
# The scenario where the reference_ts is older is likely due to a partition
# being slightly behind another partition that we've already read from
if precheck_last_ts is not None and precheck_last_ts >= reference_ts:
# The scenario where the slowest_part_ts is older may happen when our
# MONITOR_TASKS_PARTITION_CLOCKS set did not know about every partition the
# topic is responsible for. Older check-ins may be processed after newer
# ones in diferent topics. This should only happen if redis loses state.
if precheck_last_ts is not None and precheck_last_ts >= slowest_part_ts:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can also happen if the consumer crashes without being able to save the offset, replaying messages is a common Kafka case

@evanpurkhiser evanpurkhiser merged commit bfb24f6 into master Oct 13, 2023
@evanpurkhiser evanpurkhiser deleted the evanpurkhiser/fix-crons-enable-clock-tick-partition-syncing branch October 13, 2023 18:16
@github-actions github-actions bot locked and limited conversation to collaborators Oct 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants