errors when reading from the mq, possibly blocking '-m fast' restart #2

tvondra · 2016-11-23T07:31:53Z

Hi Alexander,

I've done a review and a bit of testing of the extension today, and I've ran into some strange issues in high-concurrency environments. Essentially, I do have two pgbench tests running at the same time:

a regular pgbench with 72 clients, using the standard workload (so "pgbench -c 72 ...")
a pgbench reading the collected wait data, essentially running this custom SQL script (16 clients)

select count() from pg_wait_sampling_current;
select count() from pg_wait_sampling_history;
select count(*) from pg_wait_sampling_profile;

After a short while, I get these errors in the second pgbench:

client 13 aborted in state 1: ERROR: Error reading mq.
client 4 aborted in state 1: ERROR: Error reading mq.

What's worse, running "pg_ctl restart" on the cluster times out - there's no CPU or I/O activity, the cluster should restart without any issue, but I suppose there are some locking issues or so, caused by the mq read failures.

Regarding the code - I'm not sure what is the purpose of setup_gucs(). Why not to simply define the GUC variables? If anything, get_guc_variables() is only meant to be used in help_config.c (per comment in guc.c).

Also, should the bgworker main method really do proc_exit(1) instead of proc_exit(0)? At least that's what the other workers I've seen do.

The text was updated successfully, but these errors were encountered:

akorotkov · 2016-12-01T17:01:35Z

Hi Tomas,

thank you very much for reporting!
I'm quite busy now, but I hope to fix this during next week.

akorotkov · 2017-01-29T14:16:20Z

Hi Tomas,

I'm sorry for returning to this so late.

client 13 aborted in state 1: ERROR: Error reading mq.
client 4 aborted in state 1: ERROR: Error reading mq.

This seems to be concurrency issue that should be resolved by 4fdf032. Could you, please, recheck?

Regarding the code - I'm not sure what is the purpose of setup_gucs(). Why not to simply define the GUC variables? If anything, get_guc_variables() is only meant to be used in help_config.c (per comment in guc.c).

This is kind of black magic used to place GUCs into shared memory and make them work correctly (normally postgres doesn't allow to do so). In the future that magic should be removed, but it works for now.

Also, should the bgworker main method really do proc_exit(1) instead of proc_exit(0)? At least that's what the other workers I've seen do.

Collector process does proc_exit(0) only on SIGTERM, i.e. when worker was shut down by external request. proc_exit(0) seems to be correct behavior for this case.

tvondra · 2017-01-30T22:47:21Z

Will check. I'm running some other tests on the machine now, but hopefully I'll be able to do some testing next week.

tvondra · 2017-02-21T12:20:37Z

I repeated the stress test, can't reproduce the original issue anymore even after running it for 8 hourse.

tvondra closed this as completed Feb 21, 2017

banlex73 mentioned this issue Dec 11, 2020

pg_wait_sampling process blocks select * FROM pg_wait_sampling_profile ; When database was dropped from the cluster #29

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

errors when reading from the mq, possibly blocking '-m fast' restart #2

errors when reading from the mq, possibly blocking '-m fast' restart #2

tvondra commented Nov 23, 2016 •

edited

Loading

akorotkov commented Dec 1, 2016

akorotkov commented Jan 29, 2017

tvondra commented Jan 30, 2017

tvondra commented Feb 21, 2017

errors when reading from the mq, possibly blocking '-m fast' restart #2

errors when reading from the mq, possibly blocking '-m fast' restart #2

Comments

tvondra commented Nov 23, 2016 • edited Loading

akorotkov commented Dec 1, 2016

akorotkov commented Jan 29, 2017

tvondra commented Jan 30, 2017

tvondra commented Feb 21, 2017

tvondra commented Nov 23, 2016 •

edited

Loading