Skip to content

errors when reading from the mq, possibly blocking '-m fast' restart #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tvondra opened this issue Nov 23, 2016 · 4 comments
Closed

Comments

@tvondra
Copy link

tvondra commented Nov 23, 2016

Hi Alexander,

I've done a review and a bit of testing of the extension today, and I've ran into some strange issues in high-concurrency environments. Essentially, I do have two pgbench tests running at the same time:

  1. a regular pgbench with 72 clients, using the standard workload (so "pgbench -c 72 ...")

  2. a pgbench reading the collected wait data, essentially running this custom SQL script (16 clients)

    select count() from pg_wait_sampling_current;
    select count(
    ) from pg_wait_sampling_history;
    select count(*) from pg_wait_sampling_profile;

After a short while, I get these errors in the second pgbench:

client 13 aborted in state 1: ERROR: Error reading mq.
client 4 aborted in state 1: ERROR: Error reading mq.

What's worse, running "pg_ctl restart" on the cluster times out - there's no CPU or I/O activity, the cluster should restart without any issue, but I suppose there are some locking issues or so, caused by the mq read failures.

Regarding the code - I'm not sure what is the purpose of setup_gucs(). Why not to simply define the GUC variables? If anything, get_guc_variables() is only meant to be used in help_config.c (per comment in guc.c).

Also, should the bgworker main method really do proc_exit(1) instead of proc_exit(0)? At least that's what the other workers I've seen do.

@akorotkov
Copy link
Contributor

Hi Tomas,

thank you very much for reporting!
I'm quite busy now, but I hope to fix this during next week.

@akorotkov
Copy link
Contributor

Hi Tomas,

I'm sorry for returning to this so late.

client 13 aborted in state 1: ERROR: Error reading mq.
client 4 aborted in state 1: ERROR: Error reading mq.

This seems to be concurrency issue that should be resolved by 4fdf032. Could you, please, recheck?

Regarding the code - I'm not sure what is the purpose of setup_gucs(). Why not to simply define the GUC variables? If anything, get_guc_variables() is only meant to be used in help_config.c (per comment in guc.c).

This is kind of black magic used to place GUCs into shared memory and make them work correctly (normally postgres doesn't allow to do so). In the future that magic should be removed, but it works for now.

Also, should the bgworker main method really do proc_exit(1) instead of proc_exit(0)? At least that's what the other workers I've seen do.

Collector process does proc_exit(0) only on SIGTERM, i.e. when worker was shut down by external request. proc_exit(0) seems to be correct behavior for this case.

@tvondra
Copy link
Author

tvondra commented Jan 30, 2017

Will check. I'm running some other tests on the machine now, but hopefully I'll be able to do some testing next week.

@tvondra
Copy link
Author

tvondra commented Feb 21, 2017

I repeated the stress test, can't reproduce the original issue anymore even after running it for 8 hourse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants