-
Notifications
You must be signed in to change notification settings - Fork 611
Postgres takes 5 minutes to start up and incorrectly reports readiness #3798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello @mausch! I have not been able to reproduce the behavior you are seeing... Are you waiting for the initial backup to complete before scaling down? How are you scaling the cluster down to 0 instances? What version of PGO are you using (the image tag you listed is for a postgres image)? Can you send your postgrescluster spec? Can you reproduce the behavior again and send the resulting PGO logs? |
No idea how to check this, or why it would matter tbh 🙂
kubernetes dashboard or
Already include in my initial message
|
@mausch just to clarify - when you say you are scaling down to 0, what exactly are you scaling down to 0? The StatefulSet for the PG instance? What is the purpose for doing so? As you said, PGO will scale it back to 1 when you do so. This appears to be forcing PG into recovery (which is clear based on the logs you provided), which could be the reason you're seeing a slower startup time. What behavior do you see when you do not scale down to 0, effectively allowing for a normal startup? Startup time should be very fast for a brand new cluster (e.g. no need to wait for initial backup or anything like that before connecting - you can connect when the initial primary/leader is ready), assuming you allow a clean bootstrap & start. I'll also note that we are leveraging the HA system for readiness:
|
Apologies for the very late reply!
Yes
Simulating a crash or any other operations that replaces the pod e.g. bumping up memory limits.
This takes several minutes though. I just had a case of this where I bumped up memory limits and it took 4 minutes to start up. Since this is still an issue please reopen. |
@mausch per my comment above, it sounds like you are simply dealing the typical Postgres recovery process. And I'll also note too that this is the exact reason for CPK's HA functionality. In other words, if you want your database to continue to be available for writes immediately following the failure of the current primary, you should simply add another replica. Focusing on the startup time here seems misguided, since you're simply relying on Postgres to go through its typical recovery process (and again, this is the exact reason additional replicas can be added to create a highly-available cluster). And as for your note around readiness, this is again likely tied to the recovery process. More specifically, during this time (i.e., during recovery) Postgres will likely be accepting connections as a hot standby. Therefore, if you attempt to connect to the DB during this time, you'll likely find that you are able to do so, only via a read-only connection (since the database will still be in recovery). So again, this also sounds normal. Additionally, going through this thread I did see that the one thing we haven't taken a look at are the Postgres logs. More specifically, the Postgres logs at the specific time in which you're seeing Patroni log the |
Thanks for the quick reply! If I understand correctly, this means (please correct me if I'm wrong):
Here are the logs after a pod restart:
|
@mausch those are actually the Patroni logs, which are different than the actual Postgres logs. Can you provide the Postgres logs found under |
Oh, sorry about the confusion.
|
Overview
Create a trivial PostgresCluster with a single instance. Scale it down to 0. Scale it back up to 1 (actually I think the operator automatically brings it back to 1).
Logs show that the instance takes 5 minutes to actually be ready. During those 5 minutes the pod reports as ready and live, so applications try to connect to it. Because it's not actually ready all connections fail.
So the issues here are:
Logs:
Environment
Please provide the following details:
The text was updated successfully, but these errors were encountered: