Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault caused on PHP #9446

Closed
JimmyBanks opened this issue Jan 27, 2022 · 31 comments · Fixed by #9995
Closed

Segmentation Fault caused on PHP #9446

JimmyBanks opened this issue Jan 27, 2022 · 31 comments · Fixed by #9995
Assignees

Comments

@JimmyBanks
Copy link

What version of protobuf and what language are you using?
Protobuf 3.19.3 installed using pecl
PHP8.1.2 using PHP-FPM

What operating system (Linux, Windows, ...) and version?
Ubuntu 18.04

What runtime / compiler are you using (e.g., python version or gcc version)
c version installed with pecl

What did you do?

  1. Install:
Steps to reproduce the behavior:
sudo pecl install grpc
sudo pecl install protobuf
echo "extension=grpc.so" | sudo tee -a /etc/php/8.1/fpm/conf.d/grpc.ini
echo "extension=protobuf.so" | sudo tee -a /etc/php/8.1/fpm/conf.d/protobuf.ini
  1. Release to production and higher traffic causes the segmentation fault. All appeared functional on staging with limited traffic

What did you expect to see
No segfault as this breaks the request

What did you see instead?
Seg faults listed in php8.1-fpm.log

[21-Jan-2022 16:17:11] WARNING: [pool www] child 3379 exited on signal 11 (SIGSEGV - core dumped) after 430.745632 seconds from start
[21-Jan-2022 16:17:11] NOTICE: [pool www] child 3603 started
[21-Jan-2022 16:17:17] WARNING: [pool www] child 3120 exited on signal 11 (SIGSEGV - core dumped) after 706.382257 seconds from start

Make sure you include information that can help us debug (full error message, exception listing, stack trace, logs).

Anything else we should know about your project / environment
Stack: Ubuntu + Nginx + PHP-FPM

Backtraces:

Core was generated by `php-fpm: pool www                                                            '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f72134fa743 in Arena_Get (val=val@entry=0x7f7226552c18)
    at /tmp/pear/temp/protobuf/arena.c:72
72      /tmp/pear/temp/protobuf/arena.c: No such file or directory.
[Current thread is 1 (Thread 0x7f722c618500 (LWP 1938))]
(gdb) bt
#0  0x00007f72134fa743 in Arena_Get (val=val@entry=0x7f7226552c18)
    at /tmp/pear/temp/protobuf/arena.c:72
#1  0x00007f7213506064 in zim_Message___construct (
    execute_data=0x7f72264148c0, return_value=<optimized out>)
    at /tmp/pear/temp/protobuf/message.c:579
#2  0x000055ce3ea58db2 in execute_ex ()
#3  0x000055ce3ea5920d in zend_execute ()
#4  0x000055ce3e9e69c4 in zend_execute_scripts ()
#5  0x000055ce3e982791 in php_execute_script ()
#6  0x000055ce3e81dbbd in ?? ()
#7  0x00007f72299a9bf7 in __libc_start_main (main=0x55ce3e81cda0, argc=4,
    argv=0x7ffe9219a1e8, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7ffe9219a1d8)
    at ../csu/libc-start.c:310
#8  0x000055ce3e81e9ea in _start ()
Core was generated by `php-fpm: pool www                                                            '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  _upb_arenahas (a=<optimized out>) at /tmp/pear/temp/protobuf/php-upb.h:448
448     /tmp/pear/temp/protobuf/php-upb.h: No such file or directory.
[Current thread is 1 (Thread 0x7f722c618500 (LWP 1959))]
(gdb) bt
#0  _upb_arenahas (a=<optimized out>) at /tmp/pear/temp/protobuf/php-upb.h:448
#1  upb_arena_malloc (size=<optimized out>, a=<optimized out>)
    at /tmp/pear/temp/protobuf/php-upb.h:456
#2  _upb_msg_new_inl (a=0x78, l=0x55ce40ce37b0)
    at /tmp/pear/temp/protobuf/php-upb.h:1211
#3  _upb_msg_new (l=0x55ce40ce37b0, a=0x78)
    at /tmp/pear/temp/protobuf/php-upb.c:1493
#4  0x00007f7213513435 in upb_msg_new (m=<optimized out>, a=<optimized out>)
    at /tmp/pear/temp/protobuf/php-upb.c:7047
#5  0x00007f72134ff9c5 in Message_Initialize (
    intern=intern@entry=0x7f7226402fc8, desc=desc@entry=0x7f72264728c0)
    at /tmp/pear/temp/protobuf/message.c:565
#6  0x00007f72135060a3 in zim_Message___construct (
    execute_data=0x7f7226413e40, return_value=<optimized out>)
    at /tmp/pear/temp/protobuf/message.c:602
#7  0x000055ce3ea58db2 in execute_ex ()
#8  0x000055ce3ea5920d in zend_execute ()
#9  0x000055ce3e9e69c4 in zend_execute_scripts ()
#10 0x000055ce3e982791 in php_execute_script ()
#11 0x000055ce3e81dbbd in ?? ()
#12 0x00007f72299a9bf7 in __libc_start_main (main=0x55ce3e81cda0, argc=4,
    argv=0x7ffe9219a1e8, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7ffe9219a1d8)
@haberman
Copy link
Member

I wonder if this is fixed by: #9370

Do you have any way of trying that out?

@JimmyBanks
Copy link
Author

I wonder if this is fixed by: #9370

Do you have any way of trying that out?

No unfortunately I don't.

I assume those commits will be in the next point release, is there a timeline in mind, I can test once it's on packagist

@radykal-com
Copy link

Hey @haberman ,

we've compiled the version from master and the segfault is still there.

Our runtime:
Ubuntu 20.04 PHP8.1.2 using PHP-FPM

@averges
Copy link

averges commented Feb 4, 2022

I wonder if this is fixed by: #9370

Do you have any way of trying that out?

I'm afraid it is not fixed yet. Tested #9370 on Ubuntu 20.04.3 LTS, PHP 8.1 (php-fpm)

@JimmyBanks
Copy link
Author

@haberman it's been a month since the bug was posted, and three weeks since it's been identified that #9370 does not resolve the issue.

Can you shed some light on if there is a fix in the works?

@haberman
Copy link
Member

haberman commented Mar 4, 2022

Sorry for the delay on this.

So far it sounds like all of the reports are for PHP 8.1, and only occur under heavy load. Is that correct?

I will see if any memory errors show up in a Valgrind run with PHP 8.1.

@haberman
Copy link
Member

haberman commented Mar 4, 2022

It also looks like all reports so far are using php-fpm. Does this reproduce without php-fpm? Is that feasible to test?

So far this does not trigger any Valgrind warnings or errors under any of our tests. I'm not sure how to reproduce this.

@k2w4t4h
Copy link

k2w4t4h commented Mar 4, 2022

This problem occurs only when opcache is enabled and php is under heavy load.
It looks having no difference between JIT is enabled or disabled.
So, disabling opcache is a simple quick fix.

@haberman
Copy link
Member

haberman commented Mar 4, 2022

If this only occurs on 8.1, and only with opcache enabled, it seems like something changed on the PHP side.

It seems like perhaps the message is being destroyed between Message_create() and zim_Message___construct(). Or perhaps it's being constructed without being created first.

@k2w4t4h
Copy link

k2w4t4h commented Mar 5, 2022

As you have mentioned, PHP 8.1 introduced some opcache features, such as inheritance cache.
They might affect Protocol Buffers extension.

In my case, only constructing a new proto object without any parameters under a lot of concurrent access produces a crash.

I'm trying to make a simple example project, but it is difficult.

@JimmyBanks
Copy link
Author

JimmyBanks commented Mar 25, 2022

Hi @haberman, I see [v3.20.0-rc3] has a huge list of included fixes. I've read through it and see no reference of this issue, so I assume 3.20 will not resolve this segmentation fault on 8.1?

@haberman
Copy link
Member

That is correct, there has been no movement on this issue.

We have not received a repro for this issue, and the fact that the issue only occurs with opcache enabled suggests to me that something has changed on the PHP side -- either a bug or an intended semantic change that breaks protobuf somehow.

Unless PHP 8.1 has introduced a known/documented change in the C API semantics, I'm inclined to think this is a bug on the PHP side.

The only somewhat unusual thing we are doing (that I know of) is this. If the PHP 8.1 opcache breaks this behavior somehow, perhaps we need to stop doing this:

static void Message_SuppressDefaultProperties(zend_class_entry *class_type) {
// We suppress all default properties, because all our properties are handled
// by our read_property handler.
//
// This also allows us to put our zend_object member at the beginning of our
// struct -- instead of putting it at the end with pointer fixups to access
// our own data, as recommended in the docs -- because Zend won't add any of
// its own storage directly after the zend_object if default_properties_count
// == 0.
//
// This is not officially supported, but since it simplifies the code, we'll
// do it for as long as it works in practice.
class_type->default_properties_count = 0;
}

@remicollet may have ideas here.

@remicollet
Copy link

// This also allows us to put our zend_object member at the beginning of our

What is the benefit of keeping suck hack ?
zend_object should be the last entry since 7.0, and minimal supported version is 7.0... (according to package.xml)

@haberman
Copy link
Member

If we think that this SEGV might possibly be related to the zend_object thing, I'm happy to change it.

In the past, putting zend_object first has helped us detect and disable cases where PHP was adding unnecessary and unwanted storage into our objects, eg. property storage that we do not need or want.

@CxDevLead
Copy link

CxDevLead commented May 6, 2022

My team cannot update to 8.1 until this issue has been fixed. It's now over 6 months since the release of 8.1. What progress has been made, should we expect an update soon?

We will migrate off protobuf if there is no plans to resolve the issue. Deactivating OPCache is not feasible.

Thank you. @haberman @remicollet

@haberman
Copy link
Member

haberman commented May 6, 2022

To date we have still not received a repro. Without a repro, the best we can do is guess. The best way to help accelerate a fix is to provide us with a self-contained repro.

We can try making the change I outlined in #9446 (comment), but that idea is based on guesswork. If it does not fix the problem, we are back to square one.

I will work on implementing the idea in #9446 (comment) and hope it helps. Hopefully we can get it into the release that will be going out in the next week or two.

@haberman
Copy link
Member

haberman commented May 6, 2022

After starting the implementation of this, I feel some more confidence that this theory is correct, and that the idea mentioned above will fix these SEGVs.

On the downside, the simple fix will also increase memory usage, because we will be allocating storage for message fields at both the PHP and the C levels. This is a side effect of the fact that we use the same generated code for both pure-PHP and C-accelerated PHP, so the generated classes always have PHP-level properties for all message fields.

Previously we were suppressing the PHP-level properties when using the C extension, but this hack appears to be incompatible with OPCache. So we will have to allocate memory for the PHP properties even when using the C extension, which will increase memory usage until we can change the generated code.

haberman added a commit to haberman/protobuf that referenced this issue May 6, 2022
There is a chance this may fix some SEGVs seen on PHP 8.1 when
OPCache is enabled:
  protocolbuffers#9446 (comment)

However this also will increase memory usage of PHP protos, and
will likely introduce CPU overhead also.
@haberman
Copy link
Member

haberman commented May 6, 2022

I have implemented the change I described above in #9930

Can anybody who is experiencing this problem try this out to verify whether it fixes the SEGVs? Unfortunately it will probably have some CPU/memory overhead also, so it could cause regressions if deployed to prod. But if there is a test cluster that reliably reproduces the behavior, a test there would be very helpful.

If this indeed addresses the root cause of the issue, we should rework the generated code so that we get the fix without the overhead.

@radykal-com
Copy link

@haberman we can give it a try next week. We were experiencing the segfaults in a sustained way when using PHP 8.1. I'll come back with news.

@radykal-com
Copy link

We've tried it today and the segmentation faults are still there.

We are going to try to create a reproducible scenario in a docker-compose stack.

@averges
Copy link

averges commented May 10, 2022

To date we have still not received a repro. Without a repro, the best we can do is guess. The best way to help accelerate a fix is to provide us with a self-contained repro.

We can try making the change I outlined in #9446 (comment), but that idea is based on guesswork. If it does not fix the problem, we are back to square one.

I will work on implementing the idea in #9446 (comment) and hope it helps. Hopefully we can get it into the release that will be going out in the next week or two.

We manage to reproduce the segmentation fault in a small docker environment. https://github.com/stayforlong/protobuf_sigsegv_poc

Hope this helps.

@haberman
Copy link
Member

@radykal-com Thanks for giving it a try. Unfortunately it looks like it didn't fix the problem.

@averges thank you for the repro! I was unfortunately not able to reproduce the SEGV. On my machine I get:

Successfully tagged protobuf_sigsegv_poc_ab:latest
WARNING: Image for service ab was built because it did not already exist. To rebuild this image you must use `docker-compose build` or `docker-compose up --build`.
Creating protobuf_sigsegv_poc_php-fpm_1 ... done
Creating protobuf_sigsegv_poc_nginx_1   ... done

ERROR: for ab  Container "ec55d3f684db" is unhealthy.
ERROR: Encountered errors while bringing up the project.
make: *** [Makefile:2: run] Error 1
$ make log
docker-compose logs php-fpm | grep SIGSEGV
make: *** [Makefile:5: log] Error 1
$ docker-compose logs php-fpm
Attaching to protobuf_sigsegv_poc_php-fpm_1
php-fpm_1  | [10-May-2022 23:02:02] NOTICE: fpm is running, pid 7
php-fpm_1  | [10-May-2022 23:02:02] NOTICE: ready to handle connections
$

The full logs do not seem to mention a segv or any other error.

@radykal-com
Copy link

radykal-com commented May 11, 2022

@haberman I've tried the repro from @averges and even is not happening 100% of the time, it does reproduce the SEGV most of the times. I did not get the error you showed (AB container unhealthy error) once.

From 8 executions, 5 executions did reproduce the error and 3 not. Maybe if the ab runs longer or with more concurrency will trigger the error more often. It seems like some kind of race condition, so a 100% reproducible case is gonna be imposible to create.

$ make run && make log && make clean
docker-compose up -d
protobuf_sigsegv_poc_php-fpm_1 is up-to-date
protobuf_sigsegv_poc_nginx_1 is up-to-date
Starting protobuf_sigsegv_poc_ab_1 ... done
docker-compose logs php-fpm | grep SIGSEGV
php-fpm_1  | [11-May-2022 06:55:46] WARNING: [pool www] child 10 exited on signal 11 (SIGSEGV) after 3.598312 seconds from start
php-fpm_1  | [11-May-2022 06:55:46] WARNING: [pool www] child 9 exited on signal 11 (SIGSEGV) after 3.647975 seconds from start
php-fpm_1  | [11-May-2022 06:55:49] WARNING: [pool www] child 11 exited on signal 11 (SIGSEGV) after 2.871727 seconds from start
docker-compose down
Stopping protobuf_sigsegv_poc_nginx_1   ... done
Stopping protobuf_sigsegv_poc_php-fpm_1 ... done
Removing protobuf_sigsegv_poc_ab_1      ... done
Removing protobuf_sigsegv_poc_nginx_1   ... done
Removing protobuf_sigsegv_poc_php-fpm_1 ... done
Removing network protobuf_sigsegv_poc_default

$ make run && make log && make clean
docker-compose up -d
Creating network "protobuf_sigsegv_poc_default" with the default driver
Creating protobuf_sigsegv_poc_nginx_1   ... done
Creating protobuf_sigsegv_poc_php-fpm_1 ... done
Creating protobuf_sigsegv_poc_ab_1      ... done
docker-compose logs php-fpm | grep SIGSEGV
make: *** [log] Error 1

$ make run && make log && make clean
docker-compose up -d
protobuf_sigsegv_poc_nginx_1 is up-to-date
protobuf_sigsegv_poc_php-fpm_1 is up-to-date
Starting protobuf_sigsegv_poc_ab_1 ... done
docker-compose logs php-fpm | grep SIGSEGV
make: *** [log] Error 1

$ make run && make log; make clean
docker-compose up -d
protobuf_sigsegv_poc_nginx_1 is up-to-date
protobuf_sigsegv_poc_php-fpm_1 is up-to-date
Starting protobuf_sigsegv_poc_ab_1 ... done
docker-compose logs php-fpm | grep SIGSEGV
php-fpm_1  | [11-May-2022 06:56:09] WARNING: [pool www] child 9 exited on signal 11 (SIGSEGV) after 15.395452 seconds from start
php-fpm_1  | [11-May-2022 06:56:09] WARNING: [pool www] child 11 exited on signal 11 (SIGSEGV) after 0.040299 seconds from start
docker-compose down
Stopping protobuf_sigsegv_poc_nginx_1   ... done
Stopping protobuf_sigsegv_poc_php-fpm_1 ... done
Removing protobuf_sigsegv_poc_ab_1      ... done
Removing protobuf_sigsegv_poc_nginx_1   ... done
Removing protobuf_sigsegv_poc_php-fpm_1 ... done
Removing network protobuf_sigsegv_poc_default

$ make run && make log; make clean
docker-compose up -d
Creating network "protobuf_sigsegv_poc_default" with the default driver
Creating protobuf_sigsegv_poc_nginx_1   ... done
Creating protobuf_sigsegv_poc_php-fpm_1 ... done
Creating protobuf_sigsegv_poc_ab_1      ... done
docker-compose logs php-fpm | grep SIGSEGV
make: *** [log] Error 1
docker-compose down
Stopping protobuf_sigsegv_poc_nginx_1   ... done
Stopping protobuf_sigsegv_poc_php-fpm_1 ... done
Removing protobuf_sigsegv_poc_ab_1      ... done
Removing protobuf_sigsegv_poc_nginx_1   ... done
Removing protobuf_sigsegv_poc_php-fpm_1 ... done
Removing network protobuf_sigsegv_poc_default

$ make run && make log; make clean
docker-compose up -d
Creating network "protobuf_sigsegv_poc_default" with the default driver
Creating protobuf_sigsegv_poc_nginx_1   ... done
Creating protobuf_sigsegv_poc_php-fpm_1 ... done
Creating protobuf_sigsegv_poc_ab_1      ... done
docker-compose logs php-fpm | grep SIGSEGV
php-fpm_1  | [11-May-2022 06:56:24] WARNING: [pool www] child 10 exited on signal 11 (SIGSEGV) after 3.446749 seconds from start
php-fpm_1  | [11-May-2022 06:56:24] WARNING: [pool www] child 11 exited on signal 11 (SIGSEGV) after 0.046907 seconds from start
docker-compose down
Stopping protobuf_sigsegv_poc_php-fpm_1 ... done
Stopping protobuf_sigsegv_poc_nginx_1   ... done
Removing protobuf_sigsegv_poc_ab_1      ... done
Removing protobuf_sigsegv_poc_php-fpm_1 ... done
Removing protobuf_sigsegv_poc_nginx_1   ... done
Removing network protobuf_sigsegv_poc_default

$ make run && make log; make clean
docker-compose up -d
Creating network "protobuf_sigsegv_poc_default" with the default driver
Creating protobuf_sigsegv_poc_nginx_1   ... done
Creating protobuf_sigsegv_poc_php-fpm_1 ... done
Creating protobuf_sigsegv_poc_ab_1      ... done
docker-compose logs php-fpm | grep SIGSEGV
php-fpm_1  | [11-May-2022 06:56:24] WARNING: [pool www] child 10 exited on signal 11 (SIGSEGV) after 3.446749 seconds from start
php-fpm_1  | [11-May-2022 06:56:24] WARNING: [pool www] child 11 exited on signal 11 (SIGSEGV) after 0.046907 seconds from start
docker-compose down
Stopping protobuf_sigsegv_poc_php-fpm_1 ... done
Stopping protobuf_sigsegv_poc_nginx_1   ... done
Removing protobuf_sigsegv_poc_ab_1      ... done
Removing protobuf_sigsegv_poc_php-fpm_1 ... done
Removing protobuf_sigsegv_poc_nginx_1   ... done
Removing network protobuf_sigsegv_poc_default

$ make run && make log; make clean
docker-compose up -d
Creating network "protobuf_sigsegv_poc_default" with the default driver
Creating protobuf_sigsegv_poc_nginx_1   ... done
Creating protobuf_sigsegv_poc_php-fpm_1 ... done
Creating protobuf_sigsegv_poc_ab_1      ... done
docker-compose logs php-fpm | grep SIGSEGV
php-fpm_1  | [11-May-2022 06:56:32] WARNING: [pool www] child 9 exited on signal 11 (SIGSEGV) after 3.386541 seconds from start
docker-compose down
Stopping protobuf_sigsegv_poc_php-fpm_1 ... done
Stopping protobuf_sigsegv_poc_nginx_1   ... done
Removing protobuf_sigsegv_poc_ab_1      ... done
Removing protobuf_sigsegv_poc_php-fpm_1 ... done
Removing protobuf_sigsegv_poc_nginx_1   ... done
Removing network protobuf_sigsegv_poc_default

@averges
Copy link

averges commented May 11, 2022

I've increased ab concurrency to improve the SIGSEGV hit ratio and I've also added a new command "flood".
Try the repro like @radykal-com said or with the new flood command.

make run && make log
make flood && make log
make flood && make log
make flood && make log

The easiest way to reproduce the segmentation fault is using apache ab (or other benchmarking tool) directly on a server with php-fpm + opcache.

@haberman
Copy link
Member

I can't seem to get the Docker image to come up successfully, but I will use the html/PHP files from the repro and set up php-fpm manually on my desktop.

@haberman
Copy link
Member

I have managed to reproduce this locally!

Thanks everyone for your help in reaching this point. This will make it much easier to get to a root cause.

@haberman
Copy link
Member

A few early results that rule out possible causes:

  • Crash occurs even with opcache.optimization_level=0, so we can rule out opcache optimizations as a root cause.
  • Crash occurs even with opcache.jit=disable, so we can rule out the opcache JIT as a root cause.
  • Crash occurs regardless of whether opcache.preferred_memory_model is set to mmap, shm, or posix.

The crashes appear to be specific to the shared memory operation under concurrency:

  • Crash does not occur with opcache.file_cache_only=1, so it requires that shared memory is activated.
  • Crash does not occur unless we make concurrent requests

@haberman
Copy link
Member

I have a fix that appears to work for me: #9995

If anyone can confirm the fix, that would be very helpful to ensure that this fixes the bug for others.

@radykal-com
Copy link

radykal-com commented May 19, 2022

We've tried the fixed version for ~1h and it seems fixed.

Thanks you so much @haberman

When is the next release scheduled?

@JimmyBanks
Copy link
Author

Spectacular, thank you so much @haberman 🙏🙏

@haberman
Copy link
Member

We've tried the fixed version for ~1h and it seems fixed.

That's great to hear!

When is the next release scheduled?

I've cherry-picked this into the 21.x branch: #9996

It will go out in the 3.21.0 release, which should go out in the next week or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants