Skip to content

php-fpm trying to kill another user's pool results in an infinite loop and 99% cpu usage #8072

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
iamsyh opened this issue Feb 9, 2022 · 5 comments

Comments

@iamsyh
Copy link

iamsyh commented Feb 9, 2022

Description

Hi,

We've come across an interesting issue where an instance of the php-fpm pool ran by one user gets stuck restarting and tries to kill another user's php-fpm pool instances.. similar issue has been described here:

https://bugs.php.net/bug.php?id=74709

Can you please help us fix this? Here's a snippet of strace of the process in question:-

strace: Process 34545 attached
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)
fcntl(3, F_GETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=1, l_len=1, l_pid=32328}) = 0
kill(32328, SIGTERM) = -1 EPERM (Operation not permitted)

the process for which above strace is for is owned by user "xyz", while the process it's trying to kill (process id: 32328) is owned by user "anc".

I would also like to add that this causes all of our websites to timeout. by all websites i mean websites which are using the same php version. like for example the process that's currently stuck and reported in my below mssg is using php 8.0.15 and so all of the websites on my server using php 8.0.15 are timing out.

PHP Version

8.0.15

Operating System

CloudLinux v7.9.0 (CentOS 7)

@cmb69
Copy link
Member

cmb69 commented Feb 14, 2022

@bukka, could you please check this?

@bukka
Copy link
Member

bukka commented Feb 15, 2022

I'm aware that there are issues with using opcache when there are pools with different user.

I think an ideal solution would be to separate opcache shared memory for each pool but that might be a bit tricky and not probably applicable as a bug fix. That's something that could be potentially done in FPM by having some pool process manager that would do MINIT for each pool but that's not a small thing to do. Not sure if it can be currently handled by opcache as we would probably have to make it aware of all pools when allocating shared memory and then it would somehow need select the right one in child but that seems even more complicated and not probably optimal but I'm not really an expert on opcache so might be wrong.

There might be some other things that could be potentially done to handle this specific case but think it might need to be done in opcache as I don't think the process kill comes from FPM but from opcache. FPM kills processes from master but not from the child (at least I can't remember any place where it would do so) so you would not likely see such error because master will be most likely root in your case. The specific place where I think this might be happening is here:

static inline void kill_all_lockers(struct flock *mem_usage_check)
{
int success, tries;
/* so that other process won't try to force while we are busy cleaning up */
ZCSG(force_restart_time) = 0;
while (mem_usage_check->l_pid > 0) {
/* Try SIGTERM first, switch to SIGKILL if not successful. */
int signal = SIGTERM;
errno = 0;
success = 0;
tries = 10;
while (tries--) {
zend_accel_error(ACCEL_LOG_WARNING, "Attempting to kill locker %d", mem_usage_check->l_pid);
if (kill(mem_usage_check->l_pid, signal)) {
if (errno == ESRCH) {
/* Process died before the signal was sent */
success = 1;
zend_accel_error(ACCEL_LOG_WARNING, "Process %d died before SIGKILL was sent", mem_usage_check->l_pid);
} else if (errno != 0) {
zend_accel_error(ACCEL_LOG_WARNING, "Failed to send SIGKILL to locker %d: %s", mem_usage_check->l_pid, strerror(errno));
}
break;
}
/* give it a chance to die */
usleep(20000);
if (kill(mem_usage_check->l_pid, 0)) {
if (errno == ESRCH) {
/* successfully killed locker, process no longer exists */
success = 1;
zend_accel_error(ACCEL_LOG_WARNING, "Killed locker %d", mem_usage_check->l_pid);
} else if (errno != 0) {
zend_accel_error(ACCEL_LOG_WARNING, "Failed to check locker %d: %s", mem_usage_check->l_pid, strerror(errno));
}
break;
}
usleep(10000);
/* If SIGTERM was not sufficient, use SIGKILL. */
signal = SIGKILL;
}
if (!success) {
/* errno is not ESRCH or we ran out of tries to kill the locker */
ZCSG(force_restart_time) = time(NULL); /* restore forced restart request */
/* cannot kill the locker, bail out with error */
zend_accel_error_noreturn(ACCEL_LOG_ERROR, "Cannot kill process %d!", mem_usage_check->l_pid);
}
mem_usage_check->l_type = F_WRLCK;
mem_usage_check->l_whence = SEEK_SET;
mem_usage_check->l_start = 1;
mem_usage_check->l_len = 1;
mem_usage_check->l_pid = -1;
if (fcntl(lock_file, F_GETLK, mem_usage_check) == -1) {
zend_accel_error(ACCEL_LOG_DEBUG, "KLockers: %s (%d)", strerror(errno), errno);
break;
}
if (mem_usage_check->l_type == F_UNLCK || mem_usage_check->l_pid <= 0) {
break;
}
}
}
. That's the only place where I see the kill and that function is called only on forced restart by the look at the code. Do you by any chance use opcache.force_restart_timeout directive?

As I said I'm not really expert on opcache but @dstogov is so he might have some ideas how to potentially address this.

@dstogov
Copy link
Member

dstogov commented Feb 18, 2022

I'm afraid, there is no an easy way to fix this. opcache was designed long time before FPM and that time we can't imagine that different workers might be owned by different users. May be it's possible to fork another "manager" process owned by root, during MINIT, and then kill children through it, but this definitely won't be merged into old PHP versions.

@gersonfs
Copy link

Same problem here but with PHP 7.4.33

@henri9813
Copy link

Hello,

Any updates here ?

I got the same problem, how could we solve the issue ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants