profile_pid (and profile_queries) GUCs are reset to defaults when a parallel query is run #85

mbanck-cd · 2025-01-24T15:42:35Z

One of our customers complained that POWA's web interface was getting very slow to respond and they tracked it down to slow queries being done on the POWA database, which in turn were correlated with very large pg_wait_sampling data. POWA recommends turning off pg_wait_sampling.profile_pid (see https://powa.readthedocs.io/en/latest/components/stats_extensions/pg_wait_sampling.html#configuration), which they did.

What they noticed though, was that even after turning off profile_pid, it was set back to on later on. They also mentioned that they ran the pg_wait_sampling_reset_profile() function multiple times to keep the POWA web interface responsive.

We could reproduce the problem that profile_pid is reset to on (the default) directly after calling pg_wait_sampling_reset_profile(), but only once - After reloading the PostgreSQL configuration, the GUC was back to off but would then agan be set to on when pg_wait_sampling_reset_profile() is called. However, we have not managed to find a way to reliably reproduce this from an empty instance or database.

We had a look at the code but so far did not see an obvious issue with how the configuration parameters are set, so are just opening this issue in case this rings a bell somewhere, or somebody knows what is going on and can fix it. We will continue to diagnose the problem next week.

The text was updated successfully, but these errors were encountered:

mbanck-cd · 2025-01-24T18:39:21Z

I think I can reproduce it well enough using this dump: https://github.com/credativ/omdb-postgresql/releases/download/2022-10-18/omdb.dump and those pgbench scripts: https://github.com/credativ/omdb-postgresql/tree/master/pgbench

Preparation (on Debian, using the postgresql-common tools):

pg_virtualenv -v 17
createdb omdb
pg_restore -d omdb omdb.dump
cd omdb-postgresql/pgbench && ./pgbench.sh --time 3600 --client 2 &
pg_conftool set shared_preload_libraries 'pg_wait_sampling'
pg_conftool set pg_wait_sampling.profile_pid off
pg_ctlcluster 17 regress restart
./pgbench.sh --time 3600 --client 2 &

Now if I run pg_ctlcluster 17 regress reload and then output the value for profile_pid, most of the time after 5 seconds it is back to on:

$ pg_ctlcluster 17 regress reload; psql -c "SHOW pg_wait_sampling.profile_pid"; sleep 5; psql -c "SHOW pg_wait_sampling.profile_pid;"
 pg_wait_sampling.profile_pid 
------------------------------
 off
(1 row)

 pg_wait_sampling.profile_pid 
------------------------------
 on
(1 row)

shinderuk · 2025-01-25T13:29:44Z

@mbanck-ntap Thank you for the report!

I've reproduced the bug using your instructions. It's not related to pg_wait_sampling_reset_profile(). It has to do with parallel workers and pg_wait_sampling storing GUCs in shared memory. Storing GUC variables in shared memory looked weird, but thanks to your report, we now know that it's completely broken.

During initialization a parallel worker temporarily resets GUCs to their defaults and then sets them to the values received from the leader (see RestoreGUCState). So there is a window during which pg_wait_sampling GUCs in shared memory are reset by a parallel worker to their defaults. That's already a problem. Further, If during this window another session launches a parallel worker, it will receive wrong current values (the defaults) from the leader and will permanently set them.

If I set max_parallel_workers to 0, I no longer observe the bug. Could you confirm this?

To check my reasoning I added a delay in RestoreGUCState and reproduced the bug with just two sessions concurrently launching parallel workers.

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0831d45aba3..1f9bf94ef5f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -6303,6 +6303,8 @@ RestoreGUCState(void *gucstate)
 		InitializeOneGUCOption(gconf);
 	}
 
+	pg_usleep(3000000L);
+
 	/* First item is the length of the subsequent data */
 	memcpy(&len, gucstate, sizeof(len));

repro.sh:

psql -c 'show pg_wait_sampling.profile_pid'
psql -c 'set debug_parallel_query = on; select count(*) from pg_class' &
sleep 1
psql -c 'set debug_parallel_query = on; select count(*) from pg_class'
wait
psql -c 'show pg_wait_sampling.profile_pid'

Output:

$ ./repro.sh 
 pg_wait_sampling.profile_pid 
------------------------------
 off
(1 row)

SET
 count 
-------
   415
(1 row)

SET
 count 
-------
   415
(1 row)

 pg_wait_sampling.profile_pid 
------------------------------
 on
(1 row)

It's an ancient bug, maybe from the origin. We'll try to fix it. And I think we should just stop using shared memory for the GUCs.

Thanks again!

mbanck-cd · 2025-01-27T08:01:01Z

Ah yes, with debug_parallel_query = on this is trivially reproducible with a minimal pgbench run - I tried that first, but as pgbench's queries are too simple to be parallelized, I could not reproduce it there.

Also, setting max_parallel_workers_per_gather = 0 makes the problem go away, i.e., profile_pid stays off.

Regarding pg_wait_sampling_reset_profile(), I initially saw the parameter changed just after running it so believed there was a correlation, but have disregarded that now, I have updated the issue title to reflect that.

Medvecrab · 2025-01-27T10:12:14Z

We have discussed this issue and the only way that we found to remove this behaviour with parallel workers is to move GUC variables from shared memory to local process memory as they are in every other place

This will bring the following changes in behaviour:

All GUC variables will become PGC_SIGHUP instead of PGC_SUSET - meaning that we could only change their values when receiving SIGHUP (= postmaster re-reading config file). SET pg_wait_sampling.profile_pid from superuser will no longer affect any other process (as it should be)
No longer will GUCs change when dealing parallel workers

Having GUCs in shared memory is fun but we prefer robust solutions that do not deviate from standard usage rules of GUCs

If you have any objections (e.g. you actively use SETting pg_wait_sampling GUCs in your scripts) feel free to join the discussion

mbanck-cd · 2025-01-27T10:18:22Z

Sounds good; at least we don't need to SET it directly and I think the way setting this affects other sessions as well (as the pg_wait_sampling tables/views are shared between all sessions I believe?) it makes sense that this should be centrally set and changed on SIGHUP.

Medvecrab · 2025-01-27T10:38:26Z

I think the way setting this affects other sessions as well

Right now, even using SET, you change collector's behaviour for ALL sessions - for the views and for the collected statistics. I think that even SET LOCAL affects everything right now. With proposed change we will have global change of GUC value only in "global" cases (re-reading config) and not in the "local-est" ones (using SET [LOCAL])

mbanck-cd · 2025-01-27T13:25:54Z

Well in that case, changing it to PGC_SIGHUP sounds like the right approach to me

For more context and reasoning see issue #85

mbanck-cd · 2025-02-10T14:43:31Z

Thanks for merging the PR! I built pg_wait_sampling packages with the two commits 8f6f7e6 and 4826caa locally and they work fine.

Will there be a new (point) release of pg_wait_sampling with the those fixes soon?

shinderuk · 2025-02-11T07:36:05Z

@mbanck-ntap Thank you for testing! I've tagged a release: https://github.com/postgrespro/pg_wait_sampling/releases/tag/v1.1.7

shinderuk assigned Medvecrab Jan 25, 2025

mbanck-cd changed the title ~~profile_pid (and profile_queries) GUCs are sometimes reset to default after e.g. running pg_wait_sampling_reset_profile()~~ profile_pid (and profile_queries) GUCs are reset to defaults when a parallel query is run Jan 27, 2025

Medvecrab pushed a commit that referenced this issue Jan 28, 2025

Store GUC variables in local process memory to avoid IPC shenanigans

33ae379

For more context and reasoning see issue #85

Medvecrab mentioned this issue Jan 28, 2025

Store GUC variables in local process memory to avoid IPC shenanigans #86

Merged

Medvecrab pushed a commit that referenced this issue Jan 28, 2025

Store GUC variables in local process memory to avoid IPC shenanigans

23be668

For more context and reasoning see issue #85

Medvecrab pushed a commit that referenced this issue Feb 4, 2025

Store GUC variables in local process memory to avoid IPC shenanigans

94ec1f7

For more context and reasoning see issue #85

Medvecrab pushed a commit that referenced this issue Feb 4, 2025

Store GUC variables in local process memory to avoid IPC shenanigans

7c6da04

For more context and reasoning see issue #85

Medvecrab pushed a commit that referenced this issue Feb 4, 2025

Store GUC variables in local process memory to avoid IPC shenanigans

a5add77

For more context and reasoning see issue #85

shinderuk pushed a commit that referenced this issue Feb 4, 2025

Store GUC variables in local process memory to avoid IPC shenanigans

8f6f7e6

For more context and reasoning see issue #85

shinderuk closed this as completed Feb 11, 2025

Medvecrab mentioned this issue Mar 12, 2025

pg_wait_sampling_get_profile() invalid memory alloc request #96

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

profile_pid (and profile_queries) GUCs are reset to defaults when a parallel query is run #85

profile_pid (and profile_queries) GUCs are reset to defaults when a parallel query is run #85

mbanck-cd commented Jan 24, 2025

mbanck-cd commented Jan 24, 2025 •

edited

Loading

shinderuk commented Jan 25, 2025

mbanck-cd commented Jan 27, 2025

Medvecrab commented Jan 27, 2025

mbanck-cd commented Jan 27, 2025 •

edited

Loading

Medvecrab commented Jan 27, 2025

mbanck-cd commented Jan 27, 2025

mbanck-cd commented Feb 10, 2025

shinderuk commented Feb 11, 2025

profile_pid (and profile_queries) GUCs are reset to defaults when a parallel query is run #85

profile_pid (and profile_queries) GUCs are reset to defaults when a parallel query is run #85

Comments

mbanck-cd commented Jan 24, 2025

mbanck-cd commented Jan 24, 2025 • edited Loading

shinderuk commented Jan 25, 2025

mbanck-cd commented Jan 27, 2025

Medvecrab commented Jan 27, 2025

mbanck-cd commented Jan 27, 2025 • edited Loading

Medvecrab commented Jan 27, 2025

mbanck-cd commented Jan 27, 2025

mbanck-cd commented Feb 10, 2025

shinderuk commented Feb 11, 2025

mbanck-cd commented Jan 24, 2025 •

edited

Loading

mbanck-cd commented Jan 27, 2025 •

edited

Loading