Skip to content

Racy responsemanager tests (diagnosis, needing a soluton) #273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rvagg opened this issue Nov 18, 2021 · 2 comments
Closed

Racy responsemanager tests (diagnosis, needing a soluton) #273

rvagg opened this issue Nov 18, 2021 · 2 comments
Assignees
Labels
need/triage Needs initial labeling and prioritization

Comments

@rvagg
Copy link
Member

rvagg commented Nov 18, 2021

Two racy tests showed up in #244:

  • TestValidationAndExtensions/test_update_hook_processing/can_send_extension_data/when_paused
  • TestValidationAndExtensions/hooks_can_alter_the_loader

The thing these have in common is this pattern:

  1. ProcessRequests
  2. .. some "completion" condition involving response checking
  3. ProcessRequests
  4. expect certain outcome

In the case of the first test above, it processes 3 blocks and during the last block submits a "paused" update. The "completion" condition is receiving those 3 blocks. In the second test above, it runs an empty request that fails and the "completion" condition is a certain response error code. The both then go on to register hooks and run ProcessRequests again and assert that the new hook did a thing.

But the important bit is the completion condition doesn't quite get to actual completion, because they both involve looking at the responses, which are not quite the end of a response execution.

In the first test, we can watch for 3 blocks being sent, but then queryexecutor goes on to push through a FinishTask with the ErrPaused error that switches response.state = paused which is the required condition for the rest of the test to pass. But occasionally, "receivedNBlocks" returns and the test continues quick enough that we get to the state quick enough to have a proper paused condition. I haven't looked into exactly why the timing is a problem for the second test but my bet is that it's a similar situation where the time between response received and continuing the test allows an occasional race where the setup doesn't quite make it.

I'm not sure what to do about this, but it seems to me that we need a better "completion" condition than watching responses, we want to get through to the complete end of a queryexecutor execution as well.

@rvagg rvagg added the need/triage Needs initial labeling and prioritization label Nov 18, 2021
@ipfs ipfs deleted a comment from welcome bot Nov 18, 2021
@rvagg
Copy link
Member Author

rvagg commented Nov 26, 2021

I thought #284 would fix this but it doesn't, I've tried inserting it before the failing assertions get involved to no avail. Still rare, but they seem to be easier to repro in CI than locally, probably to do with hardware differences.

What I've now noticed, is that these tests are failing @ ~10 seconds, and the context involved here has a timeout of 10 seconds. So this is a failure to execute, and perhaps does indicate problems beyond the scope of just testing, maybe we have a condition whereby a request can get stuck in the queue?

@rvagg
Copy link
Member Author

rvagg commented Nov 30, 2021

closed by #287

@rvagg rvagg closed this as completed Nov 30, 2021
Repository owner moved this from Ready to Done in Project Thunder (Interop) Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/triage Needs initial labeling and prioritization
Projects
None yet
Development

No branches or pull requests

2 participants