Racy responsemanager tests (diagnosis, needing a soluton) #273

rvagg · 2021-11-18T06:09:44Z

Two racy tests showed up in #244:

TestValidationAndExtensions/test_update_hook_processing/can_send_extension_data/when_paused
TestValidationAndExtensions/hooks_can_alter_the_loader

The thing these have in common is this pattern:

ProcessRequests
.. some "completion" condition involving response checking
ProcessRequests
expect certain outcome

In the case of the first test above, it processes 3 blocks and during the last block submits a "paused" update. The "completion" condition is receiving those 3 blocks. In the second test above, it runs an empty request that fails and the "completion" condition is a certain response error code. The both then go on to register hooks and run ProcessRequests again and assert that the new hook did a thing.

But the important bit is the completion condition doesn't quite get to actual completion, because they both involve looking at the responses, which are not quite the end of a response execution.

In the first test, we can watch for 3 blocks being sent, but then queryexecutor goes on to push through a FinishTask with the ErrPaused error that switches response.state = paused which is the required condition for the rest of the test to pass. But occasionally, "receivedNBlocks" returns and the test continues quick enough that we get to the state quick enough to have a proper paused condition. I haven't looked into exactly why the timing is a problem for the second test but my bet is that it's a similar situation where the time between response received and continuing the test allows an occasional race where the setup doesn't quite make it.

I'm not sure what to do about this, but it seems to me that we need a better "completion" condition than watching responses, we want to get through to the complete end of a queryexecutor execution as well.

The text was updated successfully, but these errors were encountered:

rvagg · 2021-11-26T07:02:39Z

I thought #284 would fix this but it doesn't, I've tried inserting it before the failing assertions get involved to no avail. Still rare, but they seem to be easier to repro in CI than locally, probably to do with hardware differences.

What I've now noticed, is that these tests are failing @ ~10 seconds, and the context involved here has a timeout of 10 seconds. So this is a failure to execute, and perhaps does indicate problems beyond the scope of just testing, maybe we have a condition whereby a request can get stuck in the queue?

rvagg · 2021-11-30T05:56:16Z

closed by #287

rvagg added the need/triage Needs initial labeling and prioritization label Nov 18, 2021

rvagg assigned hannahhoward Nov 18, 2021

ipfs deleted a comment from welcome bot Nov 18, 2021

hannahhoward added this to Project Thunder (Interop) Nov 18, 2021

hannahhoward moved this to Ready in Project Thunder (Interop) Nov 18, 2021

rvagg mentioned this issue Nov 19, 2021

update to context datastores #275

Merged

rvagg mentioned this issue Nov 26, 2021

feat: add WorkerTaskQueue#WaitForNoActiveTasks() for tests #284

Merged

rvagg closed this as completed Nov 30, 2021

Repository owner moved this from Ready to Done in Project Thunder (Interop) Nov 30, 2021

marten-seemann pushed a commit that referenced this issue Mar 2, 2023

fix: clear error message on channel open after restart (#273)

10c4092

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Racy responsemanager tests (diagnosis, needing a soluton) #273

Racy responsemanager tests (diagnosis, needing a soluton) #273

rvagg commented Nov 18, 2021

rvagg commented Nov 26, 2021

rvagg commented Nov 30, 2021

Racy responsemanager tests (diagnosis, needing a soluton) #273

Racy responsemanager tests (diagnosis, needing a soluton) #273

Comments

rvagg commented Nov 18, 2021

rvagg commented Nov 26, 2021

rvagg commented Nov 30, 2021