Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StreamManager: retry with get result request on already exist errors #6345

Merged
merged 3 commits into from
Nov 15, 2023

Conversation

verult
Copy link
Collaborator

@verult verult commented Nov 14, 2023

This PR fixes a race condition that occurred roughly every 10-15min by adding a retry with GetQuantumResultRequest when StreamManager receives a program or job already exists error. The sequence is as follows:

  1. The client sends a CreateProgramAndJobRequest
  2. The client's stream disconnects
  3. The client retries with a new stream and a GetResultRequest
  4. The job doesn't exist yet, and the client receives a "job not found" error
  5. Scheduler creates the program and job.
  6. The client retries with a CreateJobRequest and fails with a "job already exists" error

This would cause issues when a user specifies a program ID or job ID in Engine.run_sweep() or EngineProcessor.run_sweep() rather than letting the client generate the ID, because there could be a real ID conflict. However, the recommended path of using ProcessorSampler.run_sweep() does not specify IDs, and we're considering deprecating this ability to specify IDs. It's otherwise hard to discern between a real conflict vs. the race condition.

This is now the error handling logic after a stream breakage:

stateDiagram-v2
    [*] --> GetResult
    CreateJob --> GetResult: J
    GetResult --> CreateJob: !J
    CreateJob --> CreateProgramAndJob: !P
    CreateProgramAndJob --> GetResult: P
    CreateProgramAndJob --> GetResult: J
Loading

where

  • P = Program already exists
  • !P = Program does not exist
  • J = Job already exists
  • !J = Job does not exist

and the dot indicates the starting state.

cc @senecameeks

@verult verult requested a review from wcourtney November 14, 2023 23:10
@verult verult requested review from vtomole, cduck and a team as code owners November 14, 2023 23:10
@verult verult requested a review from pavoljuhas November 14, 2023 23:10
@verult verult force-pushed the stream-client/retry-on-job-exists branch from bafde82 to a29fcc1 Compare November 14, 2023 23:25
Copy link

codecov bot commented Nov 14, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (0e288a7) 97.84% compared to head (a29fcc1) 97.84%.

❗ Current head a29fcc1 differs from pull request most recent head b60d41f. Consider uploading reports for the commit b60d41f to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6345   +/-   ##
=======================================
  Coverage   97.84%   97.84%           
=======================================
  Files        1110     1110           
  Lines       96597    96648   +51     
=======================================
+ Hits        94516    94567   +51     
  Misses       2081     2081           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@wcourtney wcourtney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @verult noted that he tested this and demonstrated success using this PR locally.

@@ -319,7 +313,8 @@ def _get_retry_request_or_raise(
error: quantum.StreamError,
current_request,
create_program_and_job_request,
create_job_request: quantum.QuantumRunStreamRequest,
create_job_request,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep the type hint here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still there, just moved down by 1 param

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not (typically) how type annotations work in python. They should be on each parameter

)
# If the program already exists and is created as part of the stream client, the job
# should also exist because they are created at the same time.
# If the job is missing, a `CreateQuantumJobRequest` will be issued after a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the program get created w/o the job if they're created together in a CreateProgramAndJobRequest? Does a closed stream kill the server-side handling?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified a bit in the comment - it's for the unlikely case that the program is created outside StreamManager. The logic here doesn't explicitly try to solve this case, but it just happens to do the right thing.

@verult verult enabled auto-merge (squash) November 15, 2023 00:46
@verult verult merged commit 392083b into quantumlib:master Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants