Just add a retry before really throwing a FastCgiCommunicationFailed exception. #1758

wysow · 2024-03-15T12:08:43Z

Hello there!

This PR only to start a discussion to try to find a solution to this kind of random errors:

Error communicating with PHP-FPM to read the HTTP response. Bref will restart PHP-FPM now. Original exception message: hollodotme\FastCGI\Exceptions\ReadFailedException Stream got blocked, or terminated.

I'm pretty sure something smarter can be achieved but manually testing this for a few minutes now and only got 200 responses and no more 500...

…exception.

wysow · 2024-03-15T12:30:41Z

Here is something interesting about this: hollodotme/fast-cgi-client#68 (comment)

mnapoli · 2024-03-17T15:22:55Z

What if the request is updating something in the database, or sending emails for example. That could run the same request/action twice, which might not be a good thing 🤔

How often do you get these errors?

GrahamCampbell

Re-trying seems like a bad idea to me. It's usual for clients to retry safe requests, not for proxy layers to do it, unless the error is for sure retriable such as an edge load balancer retrying a TLS handshake.

wysow · 2024-03-18T09:08:40Z

@mnapoli Can't really tell how often it appears on the long run, but when I do some manual testing I sometimes get 50% error rate.... So that's a lot...

I know that retrying is a bit crappy but the fact is that in all the error I get in logs our code is never executed at all, everything happens on Bref side... So for us no problem to do a retry, and then we have a 0% error rate with manual testing.

GrahamCampbell · 2024-03-18T09:11:23Z

50% error rate smells like something else is borked. Is it always failing after the first invoke?

wysow · 2024-03-18T09:16:15Z

That's only my own feeling but yes I'm pretty sure it's always after the first invoke...

wysow · 2024-03-18T09:17:11Z

And to add more context, we did NOT see this behavior on workers or console lambdas (working with Symfony)

wysow · 2024-03-18T09:21:53Z

Just looked at the numbers of the last days with bref dashboard and I can see a 6-7% error rate on entire days.

mnapoli · 2024-03-18T10:47:32Z

That is really weird, something else must be at play here. A 6-7% error rate would be affecting all Bref users if that was a global Bref problem.

I'd start looking at ways to pinpoint the problem:

extra PHP extensions?
out of memory?
spawning sub-processes from PHP?
timing out?
try to see if it happens on a specific HTTP route?
trying to reproduce with an empty project?

wysow · 2024-03-18T14:55:52Z

Here a first list of answers, will keep you posted with others answers when I get them:

* extra PHP extensions?

-> Yes, on this project we have redis and mongodb

* out of memory?

-> I'm pretty sure this is not the case as the error is really fast at the execution start (few milliseconds). We are using 2048Mo lambda memory size on this project.

* spawning sub-processes from PHP?

-> this project is an API using bref 8.2 fpm layer and symfony so this is not the case for me here.

* timing out?

-> Like I said the error is really fast at the execution start so not the case either...

* try to see if it happens on a specific HTTP route?

-> will do more testing but I saw it on every HTTP route (GET mainly)

* trying to reproduce with an empty project?

-> Will try and keep you posted.

mnapoli · 2024-03-18T15:08:00Z

Would be interesting to see too if this happens on cold starts. If not, is the request before successful? Times out? Could fill the memory? (or any other reason it could leave the environment in a broken state)

Also nothing specific/exotic, like using Symfony Runtime, setting a non-standard handler, etc.

wysow · 2024-03-18T15:50:49Z

Would be interesting to see too if this happens on cold starts. If not, is the request before successful? Times out? Could fill the memory? (or any other reason it could leave the environment in a broken state)

Also nothing specific/exotic, like using Symfony Runtime, setting a non-standard handler, etc.

As far as I manually tested this is not happening on cold starts, and the previous request is always successful. No timeout, not full memory... Nothing visible at least....

wysow · 2024-03-18T15:52:12Z

This kind of problem is only happening in API mode, so nothing fancy outside of classic Symfony, Symfony Runtime is NOT used in this project.

wysow · 2024-03-19T15:20:41Z

File php.ini custom in our projet with this content:

extension=intl

wysow · 2024-03-19T15:22:38Z

Here is the raw log we got when this problem occur:

Error communicating with PHP-FPM to read the HTTP response. Bref will restart PHP-FPM now. Original exception message: hollodotme\FastCGI\Exceptions\ReadFailedException Stream got blocked, or terminated.

WARNING: [pool default] child 19 exited on signal 11 (SIGSEGV) after 424.432414 seconds from start

Here the 424 seconds is really weird as the behavior in an API client is really fast...

wysow · 2024-03-19T17:23:20Z

File php.ini custom in our projet with this content:
extension=intl

@mnapoli sorry this is not the right php.ini file... Here is the good one:

extension=mongodb
extension=redis
opcache.enable_cli=0

So I'm trying to delete the opcache.enable_cli=0 line right now, will keep you posted.

Just add a retry before really throwing a FastCgiCommunicationFailed …

c32ba78

…exception.

lguilbert approved these changes Mar 15, 2024

View reviewed changes

Add stop/start in retry to be sure PHPFPM is started correctly

7a8f1b4

GrahamCampbell suggested changes Mar 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Just add a retry before really throwing a FastCgiCommunicationFailed exception. #1758

Just add a retry before really throwing a FastCgiCommunicationFailed exception. #1758

wysow commented Mar 15, 2024

wysow commented Mar 15, 2024

mnapoli commented Mar 17, 2024

GrahamCampbell left a comment

wysow commented Mar 18, 2024

GrahamCampbell commented Mar 18, 2024

wysow commented Mar 18, 2024

wysow commented Mar 18, 2024

wysow commented Mar 18, 2024

mnapoli commented Mar 18, 2024

wysow commented Mar 18, 2024

mnapoli commented Mar 18, 2024

wysow commented Mar 18, 2024

wysow commented Mar 18, 2024

wysow commented Mar 19, 2024

wysow commented Mar 19, 2024

wysow commented Mar 19, 2024 •

edited

Just add a retry before really throwing a FastCgiCommunicationFailed exception. #1758

Are you sure you want to change the base?

Just add a retry before really throwing a FastCgiCommunicationFailed exception. #1758

Conversation

wysow commented Mar 15, 2024

wysow commented Mar 15, 2024

mnapoli commented Mar 17, 2024

GrahamCampbell left a comment

Choose a reason for hiding this comment

wysow commented Mar 18, 2024

GrahamCampbell commented Mar 18, 2024

wysow commented Mar 18, 2024

wysow commented Mar 18, 2024

wysow commented Mar 18, 2024

mnapoli commented Mar 18, 2024

wysow commented Mar 18, 2024

mnapoli commented Mar 18, 2024

wysow commented Mar 18, 2024

wysow commented Mar 18, 2024

wysow commented Mar 19, 2024

wysow commented Mar 19, 2024

wysow commented Mar 19, 2024 • edited

wysow commented Mar 19, 2024 •

edited