Skip to content

Got tx with plan step less than last heartbeat in one topic partition stream #6985

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dcherednik opened this issue Jul 23, 2024 · 4 comments · Fixed by #9104 or #9113
Closed

Got tx with plan step less than last heartbeat in one topic partition stream #6985

dcherednik opened this issue Jul 23, 2024 · 4 comments · Fixed by #9104 or #9113
Assignees
Labels
bug Something isn't working

Comments

@dcherednik
Copy link
Member

dcherednik commented Jul 23, 2024

{"resolved":[1721729882000,0]}
{"resolved":[1721729924000,0]}
{"resolved":[1721729955000,0]}
{"update":{"v":"1MkPAAAAAEjUyQ8AAAAASNTJDwAAAABI1MkPAAAAAA=="},"key":[4899916394580082192],"ts":[1721729950551,281475194075732]}
{"update":{"v":"wqMAAAAAAJDCowAAAAAA"},"key":[4899916394579553682],"ts":[1721729950556,281475194075747]}
{"resolved":[1721729985000,0]}
{"resolved":[1721730001000,0]}

@dcherednik dcherednik changed the title Tx with plan step grater than last heartbea Tx with plan step grater than last heartbeat in topic partition Jul 23, 2024
@dcherednik
Copy link
Member Author

Same for case of disabled VolatileTx

{"resolved":[1721744218000,0]}

{"update":{"v":"xSoIAAAAAOTFKggAAAAA5MUqCAAAAADkxSoIAAAAAOTFKggAAAAA"},"key":[3458764513821096247],"ts":[1721744215418,844425147299155]}

@dcherednik dcherednik added the bug Something isn't working label Jul 23, 2024
@dcherednik dcherednik changed the title Tx with plan step grater than last heartbeat in topic partition Got tx with plan step grater than last heartbeat in one topic partition stream Jul 23, 2024
@dcherednik dcherednik changed the title Got tx with plan step grater than last heartbeat in one topic partition stream Got tx with plan step less than last heartbeat in one topic partition stream Jul 23, 2024
@dcherednik
Copy link
Member Author

https://github.com/ydb-platform/ydb/pull/7051/files
works for case of disabled volatile tx

@snaury
Copy link
Member

snaury commented Sep 5, 2024

I tried adding a changefeed to jepsen tables and validating the resulting log, and can easily see the issue. The issue reproduces both with and without volatile transactions. Looks like it's somehow related to splits (more splits -> better chance to reproduce), but it's unclear so far.

@snaury
Copy link
Member

snaury commented Sep 5, 2024

Reproduced with a build that logs per-shard heartbeat messages (89d27b2):

https://gist.github.com/snaury/9245b47b09e7f6c943bfeea134c0111b

It is clearly visible that when 72075186224037898 split into 72075186224037901 and 72075186224037902 they both started streaming their updates and heartbeats. However it is clearly visible that while pq partition waited for new heartbeats from the second shard, it appeared to be ignored when actually producing resolved messages:

{"shard_id":72075186224037898,"resolved_private":[1725549507000,0]}
{"resolved":[1725549507000,0]}
{"shard_id":72075186224037901,"resolved_private":[1725549507000,0]}
{"shard_id":72075186224037901,"resolved_private":[1725549508000,0]}
...
{"shard_id":72075186224037901,"resolved_private":[1725549514000,0]}
{"shard_id":72075186224037902,"resolved_private":[1725549507000,0]}
{"shard_id":72075186224037901,"resolved_private":[1725549515000,0]}
{"shard_id":72075186224037902,"resolved_private":[1725549508000,0]}
{"resolved":[1725549515000,0]}
{"shard_id":72075186224037901,"resolved_private":[1725549516000,0]}
{"shard_id":72075186224037902,"resolved_private":[1725549509000,0]}
{"resolved":[1725549516000,0]}
{"shard_id":72075186224037901,"resolved_private":[1725549517000,0]}
{"shard_id":72075186224037901,"resolved_private":[1725549518000,0]}
{"shard_id":72075186224037902,"resolved_private":[1725549510000,0]}
{"resolved":[1725549518000,0]}

It is very suspicious that a new resolved message is always generated after a new resolved_private from 72075186224037902, however the actual value is from the last resolved_private from 72075186224037901. This also kept happening when 72075186224037902 did split into 72075186224037903 and 72075186224037904.

Even weirder when it backtracks:

{"resolved":[1725549546000,0]}
{"shard_id":72075186224037901,"resolved_private":[1725549547000,0]}
{"shard_id":72075186224037903,"resolved_private":[1725549548000,0]}
{"shard_id":72075186224037901,"resolved_private":[1725549548000,0]}
{"shard_id":72075186224037904,"resolved_private":[1725549534000,0]}
{"resolved":[1725549548000,0]}
{"shard_id":72075186224037904,"resolved_private":[1725549535000,0]}
{"resolved":[1725549547000,0]}

Notice how it first produced [1725549548000,0], but then a new heartbeat from 72075186224037904 appeared and it produced [1725549547000,0], a previous value!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants