-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
fix(crons): Enable clock tick partition syncing #58026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(crons): Enable clock tick partition syncing #58026
Conversation
# The scenario where the reference_ts is older is likely due to a partition | ||
# being slightly behind another partition that we've already read from | ||
if precheck_last_ts is not None and precheck_last_ts >= reference_ts: | ||
# The scenario where the slowest_part_ts is older may happen when our | ||
# MONITOR_TASKS_PARTITION_CLOCKS set did not know about every partition the | ||
# topic is responsible for. Older check-ins may be processed after newer | ||
# ones in diferent topics. This should only happen if redis loses state. | ||
if precheck_last_ts is not None and precheck_last_ts >= slowest_part_ts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case where the reference time (now slowest_part_time
) is BEFORE the last time has changed. The comment now describes when this can happen.
The previous case was basically a symptom of partition desynchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can also happen if the consumer crashes without being able to save the offset, replaying messages is a common Kafka case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah good catch. I'll update the PR separately
sentry_sdk.capture_message("Monitor task dispatch minute skipped") | ||
|
||
_dispatch_tasks(ts) | ||
_dispatch_tasks(tick) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is subtly different.
Prior to this we were using the ts
of when the kafka message was read. This was down to the microsecond. However, both tasks run from _dispatch_tasks
ALREADY were trimming off the second
and microsecond
part of the datetime, so it was essentially datetime.fromtimestamp(reference_ts)
before.
reference_ts
is now slowest_part_ts
and tick
is datetime.fromtimestamp(slowest_part_ts)
so this now matches.
This is a follow up after GH-58003. The clock which dispatches tasks will now only tick forward once all partitions have been read up to to the synchronized time
0c3f863
to
93e5ea0
Compare
# produce missed check-ins since the monitor already may have already | ||
# checked-in after and moved the `next_checkin_latest` forward. | ||
# | ||
# In practice this should almost never happen since we have a high volume of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to deal with this in the future. But I don't think this is going to pose a major problem.
Though in backlog situations I do wonder if we'll read WAY more from one partition?
Codecov Report
@@ Coverage Diff @@
## master #58026 +/- ##
==========================================
- Coverage 79.03% 79.02% -0.01%
==========================================
Files 5130 5130
Lines 223044 223044
Branches 37560 37560
==========================================
- Hits 176276 176269 -7
- Misses 41135 41138 +3
- Partials 5633 5637 +4
|
# The scenario where the reference_ts is older is likely due to a partition | ||
# being slightly behind another partition that we've already read from | ||
if precheck_last_ts is not None and precheck_last_ts >= reference_ts: | ||
# The scenario where the slowest_part_ts is older may happen when our | ||
# MONITOR_TASKS_PARTITION_CLOCKS set did not know about every partition the | ||
# topic is responsible for. Older check-ins may be processed after newer | ||
# ones in diferent topics. This should only happen if redis loses state. | ||
if precheck_last_ts is not None and precheck_last_ts >= slowest_part_ts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can also happen if the consumer crashes without being able to save the offset, replaying messages is a common Kafka case
Part of #55821
This is a follow up after GH-58003.
The clock which dispatches tasks will now only tick forward once all
partitions have been read up to to the synchronized time