-
Notifications
You must be signed in to change notification settings - Fork 38
pg_wait_sampling process blocks select * FROM pg_wait_sampling_profile ; When database was dropped from the cluster #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, I am not sure whether it is a bug or some initial architecture problem. @banlex73, non-existing database Oid looks weird of course. Is it blocked forever by collector? Or it is still possible to select from pg_wait_sampling_profile from time to time? @PavelSorokin, if you have an active PostgresPro support, you can contact them as well. Anyway, we will try to look on this issue closer. |
Hi Alexey @ololobus , I've just checked one of my cluster and can confirm that LOCK is PERMANENT. Tried to select * from pg_wait_sampling_profile and it was blocked by collector. |
I have tried today to reproduce this issue with pg_wait_sampling as per What I have tried so far is to repeat following steps (~a couple of dozens times):
Am I missing something? @banlex73, maybe you (or @PavelSorokin) can provide a more specific steps, so I can reproduce this issue? P.S. I have accidentally closed this issue when typing this comment with some weird keyboard shortcut, but I have reopened it immediately, do not worry. |
Alexey, thank you for trying..
What I can do, setup the environment and try to reproduce this issue. ETA:
next couple of days
пн, 30 лист. 2020 о 07:10 Alexey Kondratov <[email protected]> пише:
… I have tried today to reproduce this issue with pg_wait_sampling as per
stable branch state and REL_12_STABLE and cannot reproduce it.
What I have tried so far is to repeat following steps (~a couple of dozens
times):
1. Put pg_wait_sampling into shared_preload_libraries.
2. Do select * from pg_wait_sampling_profile;.
3. Create extra test database.
4. Do select * from pg_wait_sampling_profile; again with or without
additional payload (checkpoints, create table, etc).
5. Drop test database.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIHWEYADJN5S2IA7DVRMMQLSSOYXHANCNFSM4TMHLMFQ>
.
|
Quick update
another session periodically trying to select data from pg_wait_sampling_profile. |
I, I think we're having the exact same problem, we're waiting on a lock and cannot use the view pg_wait_sampling_profile anymore. We cannot use the pg_wait_sampling_history either. pg_wait_sampling_current is still accessible. Here is the lock we are stuck on:
No link with drop database though, as we don't drop databases. Tell me if you'd rather have me open an other issue. For now, we have a stack backtrace of when this occurs.
If you'd rather have me open another issue, please tell |
FYI, I'm pretty sure we already hit this bug in the past (~2 years ago) on Powa demo. |
OK, I have spend some time today digging the code and I can stably reproduce this issue (I hope). To reproduce the same collector hanging with 100% chance one have to:
Then just open a * thread #1, name = 'postgres', stop reason = signal SIGSTOP
* frame #0: 0x00007f50e83b87b7 libc.so.6`epoll_wait(epfd=3, events=0x000055d2eca369b8, maxevents=1, timeout=-1) at epoll_wait.c:30
frame #1: 0x000055d2eb296483 postgres`WaitEventSetWaitBlock(set=0x000055d2eca36940, cur_timeout=-1, occurred_events=0x00007fffc6f05a60, nevents=1) at latch.c:1080
frame #2: 0x000055d2eb29635c postgres`WaitEventSetWait(set=0x000055d2eca36940, timeout=-1, occurred_events=0x00007fffc6f05a60, nevents=1, wait_event_info=134217755) at latch.c:1032
frame #3: 0x000055d2eb295a72 postgres`WaitLatchOrSocket(latch=0x00007f50e7d0e254, wakeEvents=33, sock=-1, timeout=-1, wait_event_info=134217755) at latch.c:407
frame #4: 0x000055d2eb2958d9 postgres`WaitLatch(latch=0x00007f50e7d0e254, wakeEvents=33, timeout=0, wait_event_info=134217755) at latch.c:347
frame #5: 0x000055d2eb29eb06 postgres`shm_mq_send_bytes(mqh=0x000055d2eca37bc0, nbytes=8, data=0x00007fffc6f05c38, nowait=false, bytes_written=0x00007fffc6f05b88) at shm_mq.c:976
frame #6: 0x000055d2eb29dfb5 postgres`shm_mq_sendv(mqh=0x000055d2eca37bc0, iov=0x00007fffc6f05c00, iovcnt=1, nowait=false) at shm_mq.c:478
frame #7: 0x000055d2eb29db40 postgres`shm_mq_send(mqh=0x000055d2eca37bc0, nbytes=8, data=0x00007fffc6f05c38, nowait=false) at shm_mq.c:328
frame #8: 0x00007f50e8649bec pg_wait_sampling.so`send_profile(profile_hash=0x000055d2eca7ac58, mqh=0x000055d2eca37bc0) at collector.c:258
frame #9: 0x00007f50e864a215 pg_wait_sampling.so`collector_main(main_arg=0) at collector.c:464 and holding the same lock as reported by @banlex73 and @PavelSorokin: =# select d.datname , l.locktype, l."database", l.transactionid , l."mode", l.pid from pg_catalog.pg_locks l
left join pg_catalog.pg_database d on d."oid" = l."database";
datname | locktype | database | transactionid | mode | pid
---------+------------+------------+---------------+-----------------+-------
alexk | relation | 16384 | (null) | AccessShareLock | 18054
(null) | virtualxid | (null) | (null) | ExclusiveLock | 18054
(null) | relation | 0 | (null) | AccessShareLock | 18054
(null) | userlock | 3398742279 | (null) | ExclusiveLock | 17890 Without cancelling the query it works well, you have to just wait this extra sleep time. So we do not break collector logic with these hacks. These repro steps may look purely synthetic, but I think that the same can be reached in the wild life under the following conditions:
Not sure that 1) is absolutely required, but without it race will be much more tight. Eventually, collector wants to put more data into the queue, but nobody is listening out there already. Anyway, this is the best hypothesis I have right now. |
Thank you Alexey
I have it on all my affected clusters...
pg_stat_statements.max = '10000'
pg_stat_statements.track = 'all'
Probably I need to reduce it to make pg_wait_sampling extension works
пт, 11 груд. 2020 о 12:06 Alexey Kondratov <[email protected]> пише:
… OK, I have spend some time today digging the code and I can stably
reproduce this issue (I hope).
To reproduce the same collector hanging with 100% chance one have to:
1. Reduce shm_mq size COLLECTOR_QUEUE_SIZE to e.g. 64 bytes
2. Put a sleep just before send_profile(), I have used
pg_usleep(1000L*1000L*60);
3. Recompile and install pg_wait_sampling extension.
Then just open a psql session, do select * from
pg_wait_sampling_get_profile();, wait a couple of seconds and cancel this
query. That is all, collector will hang on forever with the same stacktrace
as provided by @marco44 <https://github.com/marco44>:
* thread #1, name = 'postgres', stop reason = signal SIGSTOP
* frame #0: 0x00007f50e83b87b7 libc.so.6`epoll_wait(epfd=3, events=0x000055d2eca369b8, maxevents=1, timeout=-1) at epoll_wait.c:30 frame #1: 0x000055d2eb296483 postgres`WaitEventSetWaitBlock(set=0x000055d2eca36940, cur_timeout=-1, occurred_events=0x00007fffc6f05a60, nevents=1) at latch.c:1080 frame #2: 0x000055d2eb29635c postgres`WaitEventSetWait(set=0x000055d2eca36940, timeout=-1, occurred_events=0x00007fffc6f05a60, nevents=1, wait_event_info=134217755) at latch.c:1032 frame #3: 0x000055d2eb295a72 postgres`WaitLatchOrSocket(latch=0x00007f50e7d0e254, wakeEvents=33, sock=-1, timeout=-1, wait_event_info=134217755) at latch.c:407 frame #4: 0x000055d2eb2958d9 postgres`WaitLatch(latch=0x00007f50e7d0e254, wakeEvents=33, timeout=0, wait_event_info=134217755) at latch.c:347 frame #5: 0x000055d2eb29eb06 postgres`shm_mq_send_bytes(mqh=0x000055d2eca37bc0, nbytes=8, data=0x00007fffc6f05c38, nowait=false, bytes_written=0x00007fffc6f05b88) at shm_mq.c:976 frame #6: 0x000055d2eb29dfb5 postgres`shm_mq_sendv(mqh=0x000055d2eca37bc0, iov=0x00007fffc6f05c00, iovcnt=1, nowait=false) at shm_mq.c:478 frame #7: 0x000055d2eb29db40 postgres`shm_mq_send(mqh=0x000055d2eca37bc0, nbytes=8, data=0x00007fffc6f05c38, nowait=false) at shm_mq.c:328 frame #8: 0x00007f50e8649bec pg_wait_sampling.so`send_profile(profile_hash=0x000055d2eca7ac58, mqh=0x000055d2eca37bc0) at collector.c:258 frame #9: 0x00007f50e864a215 pg_wait_sampling.so`collector_main(main_arg=0) at collector.c:464
and holding the same lock as reported by @banlex73
<https://github.com/banlex73> and @PavelSorokin
<https://github.com/PavelSorokin>:
=# select d.datname , l.locktype, l."database", l.transactionid , l."mode", l.pid from pg_catalog.pg_locks l left join pg_catalog.pg_database d on d."oid" = l."database";
datname | locktype | database | transactionid | mode | pid ---------+------------+------------+---------------+-----------------+-------
alexk | relation | 16384 | (null) | AccessShareLock | 18054
(null) | virtualxid | (null) | (null) | ExclusiveLock | 18054
(null) | relation | 0 | (null) | AccessShareLock | 18054
(null) | userlock | 3398742279 | (null) | ExclusiveLock | 17890
These repro steps may look purely synthetic, but I think that the same can
be reached in the wild life under the following conditions:
1. Profile size is really high, so it does not fit into the shm_mq
size at once (~16 KB).
2. Query requesting a profile was almost immediately canceled after
issuing.
Not sure that 1) is absolutely required, but without it race will be much
more tight. Eventually, collector wants to put more data into the queue,
but nobody is listening out there already.
Anyway, this is the best hypothesis I have right now.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIHWEYGYYKIFH6JPM2DYAHLSUJ3UTANCNFSM4TMHLMFQ>
.
|
That's probably what occurred to us too, I was trying to debug another issue in powa, and must have control-C'ed a query (our view is really big), and I have a tendency to realize I did something stupid the exact moment I have done it, and cancel immediately what I've just asked :) |
I think I have managed to fix this issue. At least I have added a proper cleanup (shm_mq detach) at backend ERROR, FATAL, Ctrl+C, so collector can continue its operation in this case. It is still not clear for me whether it was an original issue, so testing of the branch issue#29 is very welcome from everyone in this thread. UPD: I have tested it with PG11-13 on my own. |
Thank you Alexey
I will try to test it by the end of the week
вт, 15 груд. 2020 о 10:08 Alexey Kondratov <[email protected]> пише:
… I think I have managed to fix this issue. At least I have added a proper
cleanup (shm_mq detach) at backend ERROR, FATAL, Ctrl+C, so collector can
continue its operation in this case.
It is still not clear for me whether it was an original issue, so testing
of the branch issue#29
<https://github.com/postgrespro/pg_wait_sampling/tree/issue%2329> is very
welcome from everyone in this thread.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIHWEYGDJDDJCICMA6G6DILSU6Q3HANCNFSM4TMHLMFQ>
.
|
Hi @ololobus |
@banlex73, thanks for trying! OK, then I am going to merge PR, since it looks as a right change anyway, and release new version with PG13 support and these fixes. |
This issue was closed automatically with merging PR, but feel free to reopen if it still remains. |
Thanks!
вт, 12 січ. 2021 о 02:52 Alexey Kondratov <[email protected]> пише:
… @banlex73 <https://github.com/banlex73>, thanks for trying! OK, then I am
going to merge PR, since it looks as a right change anyway, and release new
version with PG13 support and these fixes.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIHWEYFOK6XI3Y6AARTB5A3SZQSVRANCNFSM4TMHLMFQ>
.
|
@ololobus correct me if I am wrong, the latest Release 1.1.2 doesn't have this fix, correct? |
Yes |
@banlex73, I am still waiting for internal review. Really hope that it will be done during next week, so I will be able to roll out new release |
Ok, thanks for the update.
пт, 22 січ. 2021 о 09:03 Alexey Kondratov <[email protected]> пише:
… @banlex73 <https://github.com/banlex73>, I am still waiting for internal
review. Really hope that it will be done during next week, so I will be
able to roll out new release
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIHWEYE67EEP2ZYBYZ3LCOLS3GVUZANCNFSM4TMHLMFQ>
.
|
LGTM. Sorry for the delay, let's produce the release now. |
Thanks! Just released https://github.com/postgrespro/pg_wait_sampling/releases/tag/v1.1.3. Also added info about our release process (and |
Great! thanks! |
@df7cb I pushed b91bd81 on debian branch but it seems debian tracker is looking inside master instead of debian:
I don't know if I am right or how to change this :-/ |
Thanks @anayrat!
and also added the missing |
great news!
By the way, I am on CentoOS 7 and yum cannot find 1.1.3 yet
yum list pg_wait_sampling_12
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
Installed Packages
pg_wait_sampling_12.x86_64
*
1.1.2-1.rhel7*
ср, 27 січ. 2021 о 03:53 Adrien nayrat <[email protected]> пише:
… @df7cb <https://github.com/df7cb> I pushed b91bd81
<b91bd81>
on debian branch but it seems debian tracker is looking inside master
instead of debian:
https://qa.debian.org/cgi-bin/vcswatch?package=pg-wait-sampling
Git: https://github.com/postgrespro/pg_wait_sampling.git
Branch: master
Path: debian/changelog
I don't know if I am right or how to change this :-/
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIHWEYEJOEXAASUHXX62EJ3S375B3ANCNFSM4TMHLMFQ>
.
|
@banlex73 can you try again, ti seems it is available now https://twitter.com/DevrimGunduz/status/1354589225937727490 |
thanks a lot, I can see it now
yum list pg_wait_sampling_12
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
Available Packages
pg_wait_sampling_12.x86_64
*1.1.3-1.rhel7*
чт, 28 січ. 2021 о 11:42 Adrien nayrat <[email protected]> пише:
… @banlex73 <https://github.com/banlex73> can you try again, ti seems it is
available now https://twitter.com/DevrimGunduz/status/1354589225937727490
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIHWEYAD6EZYPKX5KM7ZRGTS4G42RANCNFSM4TMHLMFQ>
.
|
What: any session what is trying to select from pg_wait_sampling_profile is being blocked by pg_wait_sampling collector
Example:
pid | usename | blocked_by | blocking_duration | blocked_query
-------+----------+------------+-------------------+-----------------------------------------------------
30120 | postgres | {8966} | 7.279804 | select * FROM pg_wait_sampling_profile ;
ps -ef|grep 8966
postgres 8966 4118 5 Jul14 ? 6-05:23:43 postgres: nanp2: pg_wait_sampling collector
List all locks what pid 8966 holds:
datname | locktype | database | transactionid | mode | pid
---------+----------+------------+---------------+---------------+------
| userlock | 3398742279 | | ExclusiveLock | 8966
(1 row)
As you see, there's no database with oid=3398742279
Environment:
pg_wait_sampling version 1.1
Postgres12.3
CentOS 7
Please let me know what else I need to share with you to help confirm and fix (if it's a bug)
Thank you in advance.
The text was updated successfully, but these errors were encountered: