Skip to content

sql: add timeout for PCR reader catalog lease acquisition #143669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 9, 2025

Conversation

fqazi
Copy link
Collaborator

@fqazi fqazi commented Mar 28, 2025

Previously, the logic to determine if a PCR reader catalog was in use could become stuck if an availability issue occurred with the leasing subsystem. This was because we could end up waiting indefinitely for the lease in failure scenarios like TestUnavailableZipDir, and the statement_timeout is not active this early. To address this, this patch adds a 30-second timeout for obtaining a lease on the system database when detecting PCR reader catalogs.

Fixes: #141565

Release note: None

@fqazi fqazi requested a review from a team March 28, 2025 18:36
@fqazi fqazi requested review from a team as code owners March 28, 2025 18:36
@fqazi fqazi requested review from angles-n-daemons, arjunmahishi, aa-joshi and Abhinav1299 and removed request for a team March 28, 2025 18:36
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@fqazi fqazi force-pushed the fixDebugZipAvailbilityBug branch 3 times, most recently from ae0c904 to 0f3354b Compare March 29, 2025 20:53
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aa-joshi, @Abhinav1299, @angles-n-daemons, and @arjunmahishi)


pkg/sql/conn_executor.go line 3837 at r1 (raw file):

// we are connecting to a PCR reader catalog, if this has not been attempted
// before.
func (ex *connExecutor) maybeInitPCRReaderCatalog(ctx context.Context) {

is this safe to call concurrently?

i'm wondering if the init needs to happen inside of a sync.Once, so that way if there are concurrent calls that all try to initialize it, they will block until the init is done.

@fqazi fqazi force-pushed the fixDebugZipAvailbilityBug branch from 0f3354b to f1c965a Compare April 8, 2025 01:29
Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aa-joshi, @Abhinav1299, @angles-n-daemons, @arjunmahishi, and @rafiss)


pkg/sql/conn_executor.go line 3837 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

is this safe to call concurrently?

i'm wondering if the init needs to happen inside of a sync.Once, so that way if there are concurrent calls that all try to initialize it, they will block until the init is done.

Done.

Good point changed this to a sync.Once

@fqazi fqazi requested a review from rafiss April 8, 2025 01:37
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aa-joshi, @Abhinav1299, @angles-n-daemons, and @arjunmahishi)


pkg/sql/conn_executor.go line 3857 at r2 (raw file):

		// unless there is some availability issue.
		const initPCRReaderCatalogTimeout = 30 * time.Second
		err := timeutil.RunWithTimeout(ctx, "detect-pcr-reader-catalog", initPCRReaderCatalogTimeout,

one more thing i wanted to ask: what if we keep the initialization code in newConnExecutor, but add the 30 second timeout there. does that resolve the test issue as well?

Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aa-joshi, @Abhinav1299, @angles-n-daemons, @arjunmahishi, and @rafiss)


pkg/sql/conn_executor.go line 3857 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

one more thing i wanted to ask: what if we keep the initialization code in newConnExecutor, but add the 30 second timeout there. does that resolve the test issue as well?

Yeah that also resolves the issue, I'll move it back there

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aa-joshi, @Abhinav1299, @angles-n-daemons, @arjunmahishi, and @fqazi)


pkg/sql/conn_executor.go line 3857 at r2 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Yeah that also resolves the issue, I'll move it back there

thanks, i'd be more comfortable with that since lazy initialization usually leads to added complexity

@fqazi fqazi force-pushed the fixDebugZipAvailbilityBug branch from f1c965a to 8e4bb15 Compare April 9, 2025 12:11
@fqazi fqazi changed the title sql: lazily determine if a PCR reader catalog is in use sql: add timeout for PCR reader catalog lease acquisition Apr 9, 2025
@fqazi fqazi requested a review from rafiss April 9, 2025 12:12
Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aa-joshi, @Abhinav1299, @angles-n-daemons, @arjunmahishi, and @rafiss)


pkg/sql/conn_executor.go line 3857 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

thanks, i'd be more comfortable with that since lazy initialization usually leads to added complexity

Done.

I also reduced the timeout to 10 seconds.

@@ -3852,6 +3843,34 @@ func (ex *connExecutor) initEvalCtx(ctx context.Context, evalCtx *extendedEvalCo
evalCtx.copyFromExecCfg(ex.server.cfg)
}

// maybeInitPCRReaderCatalog leases the system database to determine if
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the comment should say initPCRReaderCatalog, not maybeInitPCRReaderCatalog

Previously, the logic to determine if a PCR reader catalog was in use
could become stuck if an availability issue occurred with the leasing
subsystem. This was because we could end up waiting indefinitely for the
lease in failure scenarios like TestUnavailableZipDir, and the
statement_timeout is not active this early. To address this, this patch
adds a 10-second timeout for obtaining a lease on the system database
when detecting PCR reader catalogs.

Fixes: cockroachdb#141565

Release note: None
@fqazi fqazi force-pushed the fixDebugZipAvailbilityBug branch from 8e4bb15 to 59f3557 Compare April 9, 2025 17:46
@fqazi
Copy link
Collaborator Author

fqazi commented Apr 9, 2025

@rafiss TFTR!

bors r+

@craig craig bot merged commit 0ecd2ec into cockroachdb:master Apr 9, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cli: TestUnavailableZip failed
3 participants