Skip to content

pg_wait_sampling process blocks select * FROM pg_wait_sampling_profile ; When database was dropped from the cluster #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
banlex73 opened this issue Nov 6, 2020 · 30 comments · Fixed by #31

Comments

@banlex73
Copy link

banlex73 commented Nov 6, 2020

What: any session what is trying to select from pg_wait_sampling_profile is being blocked by pg_wait_sampling collector
Example:
pid | usename | blocked_by | blocking_duration | blocked_query
-------+----------+------------+-------------------+-----------------------------------------------------
30120 | postgres | {8966} | 7.279804 | select * FROM pg_wait_sampling_profile ;

ps -ef|grep 8966
postgres 8966 4118 5 Jul14 ? 6-05:23:43 postgres: nanp2: pg_wait_sampling collector

List all locks what pid 8966 holds:

select d.datname , l.locktype, l."database", l.transactionid , l."mode", l.pid  from pg_catalog.pg_locks l 
left join pg_catalog.pg_database d on d."oid" = l."database" 
where l."database" is not null and l."database" <>0
and pid=8966;

datname | locktype | database | transactionid | mode | pid
---------+----------+------------+---------------+---------------+------
| userlock | 3398742279 | | ExclusiveLock | 8966
(1 row)
As you see, there's no database with oid=3398742279

Environment:
pg_wait_sampling version 1.1
Postgres12.3
CentOS 7

Please let me know what else I need to share with you to help confirm and fix (if it's a bug)

Thank you in advance.

@PavelSorokin
Copy link

PavelSorokin commented Nov 6, 2020

Добрый день, столкнулся с аналогичной проблемой версия PostgrePro Standart 11.9 RH7 ( на двух разных серверах)(pg_wait_sampling version 1.1)
блокируются все запросы от мамонсу.
image

@ololobus
Copy link
Contributor

Hi, I am not sure whether it is a bug or some initial architecture problem.

@banlex73, non-existing database Oid looks weird of course. Is it blocked forever by collector? Or it is still possible to select from pg_wait_sampling_profile from time to time?

@PavelSorokin, if you have an active PostgresPro support, you can contact them as well.

Anyway, we will try to look on this issue closer.

@banlex73
Copy link
Author

Hi Alexey @ololobus , I've just checked one of my cluster and can confirm that LOCK is PERMANENT. Tried to select * from pg_wait_sampling_profile and it was blocked by collector.
Hope it helps.
PS: Please feel free to contact me if you need anything else to investigate this issue.

@ololobus
Copy link
Contributor

ololobus commented Nov 30, 2020

I have tried today to reproduce this issue with pg_wait_sampling as per stable branch state and REL_12_STABLE and cannot reproduce it.

What I have tried so far is to repeat following steps (~a couple of dozens times):

  1. Put pg_wait_sampling into shared_preload_libraries and create extension.
  2. Do select * from pg_wait_sampling_profile;.
  3. Create extra test database and pg_wait_sampling extension there.
  4. Do select * from pg_wait_sampling_profile; again with or without additional payload (checkpoints, create table, etc).
  5. Drop test database.
  6. Do select * from pg_wait_sampling_profile; a couple of time again. IIUC, it should get stuck, but it works for me.

Am I missing something? @banlex73, maybe you (or @PavelSorokin) can provide a more specific steps, so I can reproduce this issue?

P.S. I have accidentally closed this issue when typing this comment with some weird keyboard shortcut, but I have reopened it immediately, do not worry.

@ololobus ololobus reopened this Nov 30, 2020
@banlex73
Copy link
Author

banlex73 commented Dec 1, 2020 via email

@banlex73
Copy link
Author

banlex73 commented Dec 9, 2020

Quick update
I've been running a script to reproduce this issue for 5 days already in a loop, no luck.
What I am doing:
In a loop,

  • create database
  • generate load (bulk insert into a table, several heavy selects)
  • drop database

another session periodically trying to select data from pg_wait_sampling_profile.
Everything works fine so far, nothing blocked.
I wil run my test 2 more days and give up.

@marco44
Copy link

marco44 commented Dec 10, 2020

I, I think we're having the exact same problem, we're waiting on a lock and cannot use the view pg_wait_sampling_profile anymore. We cannot use the pg_wait_sampling_history either. pg_wait_sampling_current is still accessible. Here is the lock we are stuck on:

 locktype |  database  | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction |  pid  |     mode      | granted | fastpath 
----------+------------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+---------------+---------+----------
 userlock | 3398742279 |          |      |       |            |               |       1 |     0 |        0 | 2/0                | 30754 | ExclusiveLock | t       | f

No link with drop database though, as we don't drop databases. Tell me if you'd rather have me open an other issue. For now, we have a stack backtrace of when this occurs.

(gdb) bt
#0  0x00007fbe92b9d7b7 in epoll_wait (epfd=3, events=0x55f4a34f8d00, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x000055f4a2cda169 in WaitEventSetWait ()
#2  0x000055f4a2cda609 in WaitLatchOrSocket ()
#3  0x000055f4a2cdf8bf in ?? ()
#4  0x000055f4a2cdfc10 in shm_mq_sendv ()
#5  0x000055f4a2cdfdb0 in shm_mq_send ()
#6  0x00007fbe8ff1c51f in send_profile (mqh=0x55f4a34fae70, profile_hash=0x55f4a3554f70) at collector.c:266
#7  collector_main (main_arg=<optimized out>) at collector.c:459
#8  0x000055f4a2c74fbe in StartBackgroundWorker ()
#9  0x000055f4a2c82328 in ?? ()
#10 0x000055f4a2c8261f in ?? ()
#11 <signal handler called>
#12 0x00007fbe92b94ff7 in __GI___select (nfds=6, readfds=0x7ffc19870b00, writefds=0x0, exceptfds=0x0, timeout=0x7ffc19870a60) at ../sysdeps/unix/sysv/linux/select.c:41
#13 0x000055f4a2c83426 in ?? ()
#14 0x000055f4a2c84faf in PostmasterMain ()
#15 0x000055f4a29eeeed in main ()

If you'd rather have me open another issue, please tell

@anayrat
Copy link
Collaborator

anayrat commented Dec 11, 2020

FYI, I'm pretty sure we already hit this bug in the past (~2 years ago) on Powa demo.

@ololobus
Copy link
Contributor

ololobus commented Dec 11, 2020

OK, I have spend some time today digging the code and I can stably reproduce this issue (I hope).

To reproduce the same collector hanging with 100% chance one have to:

  1. Reduce shm_mq size COLLECTOR_QUEUE_SIZE to e.g. 64 bytes
  2. Put a sleep just before send_profile(), I have used pg_usleep(1000L*1000L*60);
  3. Recompile and install pg_wait_sampling extension.

Then just open a psql session, do select * from pg_wait_sampling_get_profile();, wait a couple of seconds and cancel this query. That is all, collector will hang on forever with the same stacktrace as provided by @marco44:

* thread #1, name = 'postgres', stop reason = signal SIGSTOP
  * frame #0: 0x00007f50e83b87b7 libc.so.6`epoll_wait(epfd=3, events=0x000055d2eca369b8, maxevents=1, timeout=-1) at epoll_wait.c:30
    frame #1: 0x000055d2eb296483 postgres`WaitEventSetWaitBlock(set=0x000055d2eca36940, cur_timeout=-1, occurred_events=0x00007fffc6f05a60, nevents=1) at latch.c:1080
    frame #2: 0x000055d2eb29635c postgres`WaitEventSetWait(set=0x000055d2eca36940, timeout=-1, occurred_events=0x00007fffc6f05a60, nevents=1, wait_event_info=134217755) at latch.c:1032
    frame #3: 0x000055d2eb295a72 postgres`WaitLatchOrSocket(latch=0x00007f50e7d0e254, wakeEvents=33, sock=-1, timeout=-1, wait_event_info=134217755) at latch.c:407
    frame #4: 0x000055d2eb2958d9 postgres`WaitLatch(latch=0x00007f50e7d0e254, wakeEvents=33, timeout=0, wait_event_info=134217755) at latch.c:347
    frame #5: 0x000055d2eb29eb06 postgres`shm_mq_send_bytes(mqh=0x000055d2eca37bc0, nbytes=8, data=0x00007fffc6f05c38, nowait=false, bytes_written=0x00007fffc6f05b88) at shm_mq.c:976
    frame #6: 0x000055d2eb29dfb5 postgres`shm_mq_sendv(mqh=0x000055d2eca37bc0, iov=0x00007fffc6f05c00, iovcnt=1, nowait=false) at shm_mq.c:478
    frame #7: 0x000055d2eb29db40 postgres`shm_mq_send(mqh=0x000055d2eca37bc0, nbytes=8, data=0x00007fffc6f05c38, nowait=false) at shm_mq.c:328
    frame #8: 0x00007f50e8649bec pg_wait_sampling.so`send_profile(profile_hash=0x000055d2eca7ac58, mqh=0x000055d2eca37bc0) at collector.c:258
    frame #9: 0x00007f50e864a215 pg_wait_sampling.so`collector_main(main_arg=0) at collector.c:464

and holding the same lock as reported by @banlex73 and @PavelSorokin:

=# select d.datname , l.locktype, l."database", l.transactionid , l."mode", l.pid  from pg_catalog.pg_locks l 
left join pg_catalog.pg_database d on d."oid" = l."database";
 datname |  locktype  |  database  | transactionid |      mode       |  pid  
---------+------------+------------+---------------+-----------------+-------
 alexk   | relation   |      16384 |        (null) | AccessShareLock | 18054
 (null)  | virtualxid |     (null) |        (null) | ExclusiveLock   | 18054
 (null)  | relation   |          0 |        (null) | AccessShareLock | 18054
 (null)  | userlock   | 3398742279 |        (null) | ExclusiveLock   | 17890

Without cancelling the query it works well, you have to just wait this extra sleep time. So we do not break collector logic with these hacks.

These repro steps may look purely synthetic, but I think that the same can be reached in the wild life under the following conditions:

  1. Profile size is really high, so it does not fit into the shm_mq size at once (~16 KB).
  2. Query requesting a profile was almost immediately canceled after issuing.

Not sure that 1) is absolutely required, but without it race will be much more tight. Eventually, collector wants to put more data into the queue, but nobody is listening out there already.

Anyway, this is the best hypothesis I have right now.

@banlex73
Copy link
Author

banlex73 commented Dec 11, 2020 via email

@marco44
Copy link

marco44 commented Dec 12, 2020

That's probably what occurred to us too, I was trying to debug another issue in powa, and must have control-C'ed a query (our view is really big), and I have a tendency to realize I did something stupid the exact moment I have done it, and cancel immediately what I've just asked :)

@ololobus
Copy link
Contributor

ololobus commented Dec 15, 2020

I think I have managed to fix this issue. At least I have added a proper cleanup (shm_mq detach) at backend ERROR, FATAL, Ctrl+C, so collector can continue its operation in this case.

It is still not clear for me whether it was an original issue, so testing of the branch issue#29 is very welcome from everyone in this thread.

UPD: I have tested it with PG11-13 on my own.

@banlex73
Copy link
Author

banlex73 commented Dec 15, 2020 via email

@banlex73
Copy link
Author

Hi @ololobus
I've been running this build for 2 weeks without any issue. So, probably we are good now.
Thank you for your effort.

@ololobus
Copy link
Contributor

@banlex73, thanks for trying! OK, then I am going to merge PR, since it looks as a right change anyway, and release new version with PG13 support and these fixes.

@ololobus
Copy link
Contributor

This issue was closed automatically with merging PR, but feel free to reopen if it still remains.

@banlex73
Copy link
Author

banlex73 commented Jan 12, 2021 via email

@banlex73
Copy link
Author

@ololobus correct me if I am wrong, the latest Release 1.1.2 doesn't have this fix, correct?

@anayrat
Copy link
Collaborator

anayrat commented Jan 13, 2021

Yes

@banlex73
Copy link
Author

@anayrat @ololobus do you have any ETA when a new version will released?

@ololobus
Copy link
Contributor

@banlex73, I am still waiting for internal review. Really hope that it will be done during next week, so I will be able to roll out new release

@banlex73
Copy link
Author

banlex73 commented Jan 22, 2021 via email

@arssher
Copy link

arssher commented Jan 27, 2021

LGTM. Sorry for the delay, let's produce the release now.

@ololobus
Copy link
Contributor

Thanks!

Just released https://github.com/postgrespro/pg_wait_sampling/releases/tag/v1.1.3. Also added info about our release process (and Debian packaging) into README.md.

And merged stable into debian. FYI, @anayrat, @df7cb

@anayrat
Copy link
Collaborator

anayrat commented Jan 27, 2021

Great! thanks!

@anayrat
Copy link
Collaborator

anayrat commented Jan 27, 2021

@df7cb I pushed b91bd81 on debian branch but it seems debian tracker is looking inside master instead of debian:
https://qa.debian.org/cgi-bin/vcswatch?package=pg-wait-sampling

Git: https://github.com/postgrespro/pg_wait_sampling.git
Branch: master
Path: debian/changelog

I don't know if I am right or how to change this :-/

@df7cb
Copy link
Collaborator

df7cb commented Jan 27, 2021

Thanks @anayrat!
I fixed the branch name via debian/control(.in):

Vcs-Git: https://github.com/postgrespro/pg_wait_sampling.git -b debian

and also added the missing -1 on the version number in debian/changelog.
Uploads to apt.postgresql.org and Debian on the way.

@banlex73
Copy link
Author

banlex73 commented Jan 27, 2021 via email

@anayrat
Copy link
Collaborator

anayrat commented Jan 28, 2021

@banlex73 can you try again, ti seems it is available now https://twitter.com/DevrimGunduz/status/1354589225937727490

@banlex73
Copy link
Author

banlex73 commented Jan 28, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants