Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Just add a retry before really throwing a FastCgiCommunicationFailed exception. #1758

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wysow
Copy link

@wysow wysow commented Mar 15, 2024

Hello there!

This PR only to start a discussion to try to find a solution to this kind of random errors:

Error communicating with PHP-FPM to read the HTTP response. Bref will restart PHP-FPM now. Original exception message: hollodotme\FastCGI\Exceptions\ReadFailedException Stream got blocked, or terminated.

I'm pretty sure something smarter can be achieved but manually testing this for a few minutes now and only got 200 responses and no more 500...

@wysow
Copy link
Author

wysow commented Mar 15, 2024

Here is something interesting about this: hollodotme/fast-cgi-client#68 (comment)

@mnapoli
Copy link
Member

mnapoli commented Mar 17, 2024

What if the request is updating something in the database, or sending emails for example. That could run the same request/action twice, which might not be a good thing 🤔

How often do you get these errors?

Copy link
Contributor

@GrahamCampbell GrahamCampbell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-trying seems like a bad idea to me. It's usual for clients to retry safe requests, not for proxy layers to do it, unless the error is for sure retriable such as an edge load balancer retrying a TLS handshake.

@wysow
Copy link
Author

wysow commented Mar 18, 2024

@mnapoli Can't really tell how often it appears on the long run, but when I do some manual testing I sometimes get 50% error rate.... So that's a lot...

I know that retrying is a bit crappy but the fact is that in all the error I get in logs our code is never executed at all, everything happens on Bref side... So for us no problem to do a retry, and then we have a 0% error rate with manual testing.

@GrahamCampbell
Copy link
Contributor

50% error rate smells like something else is borked. Is it always failing after the first invoke?

@wysow
Copy link
Author

wysow commented Mar 18, 2024

That's only my own feeling but yes I'm pretty sure it's always after the first invoke...

@wysow
Copy link
Author

wysow commented Mar 18, 2024

And to add more context, we did NOT see this behavior on workers or console lambdas (working with Symfony)

@wysow
Copy link
Author

wysow commented Mar 18, 2024

Just looked at the numbers of the last days with bref dashboard and I can see a 6-7% error rate on entire days.

@mnapoli
Copy link
Member

mnapoli commented Mar 18, 2024

That is really weird, something else must be at play here. A 6-7% error rate would be affecting all Bref users if that was a global Bref problem.

I'd start looking at ways to pinpoint the problem:

  • extra PHP extensions?
  • out of memory?
  • spawning sub-processes from PHP?
  • timing out?
  • try to see if it happens on a specific HTTP route?
  • trying to reproduce with an empty project?

@wysow
Copy link
Author

wysow commented Mar 18, 2024

Here a first list of answers, will keep you posted with others answers when I get them:

* extra PHP extensions?

-> Yes, on this project we have redis and mongodb

* out of memory?

-> I'm pretty sure this is not the case as the error is really fast at the execution start (few milliseconds). We are using 2048Mo lambda memory size on this project.

* spawning sub-processes from PHP?

-> this project is an API using bref 8.2 fpm layer and symfony so this is not the case for me here.

* timing out?

-> Like I said the error is really fast at the execution start so not the case either...

* try to see if it happens on a specific HTTP route?

-> will do more testing but I saw it on every HTTP route (GET mainly)

* trying to reproduce with an empty project?

-> Will try and keep you posted.

@mnapoli
Copy link
Member

mnapoli commented Mar 18, 2024

Would be interesting to see too if this happens on cold starts. If not, is the request before successful? Times out? Could fill the memory? (or any other reason it could leave the environment in a broken state)

Also nothing specific/exotic, like using Symfony Runtime, setting a non-standard handler, etc.

@wysow
Copy link
Author

wysow commented Mar 18, 2024

Would be interesting to see too if this happens on cold starts. If not, is the request before successful? Times out? Could fill the memory? (or any other reason it could leave the environment in a broken state)

Also nothing specific/exotic, like using Symfony Runtime, setting a non-standard handler, etc.

As far as I manually tested this is not happening on cold starts, and the previous request is always successful. No timeout, not full memory... Nothing visible at least....

@wysow
Copy link
Author

wysow commented Mar 18, 2024

This kind of problem is only happening in API mode, so nothing fancy outside of classic Symfony, Symfony Runtime is NOT used in this project.

@wysow
Copy link
Author

wysow commented Mar 19, 2024

File php.ini custom in our projet with this content:

extension=intl

@wysow
Copy link
Author

wysow commented Mar 19, 2024

Here is the raw log we got when this problem occur:

Error communicating with PHP-FPM to read the HTTP response. Bref will restart PHP-FPM now. Original exception message: hollodotme\FastCGI\Exceptions\ReadFailedException Stream got blocked, or terminated.

WARNING: [pool default] child 19 exited on signal 11 (SIGSEGV) after 424.432414 seconds from start

Here the 424 seconds is really weird as the behavior in an API client is really fast...

@wysow
Copy link
Author

wysow commented Mar 19, 2024

File php.ini custom in our projet with this content:

extension=intl

@mnapoli sorry this is not the right php.ini file... Here is the good one:

extension=mongodb
extension=redis
opcache.enable_cli=0

So I'm trying to delete the opcache.enable_cli=0 line right now, will keep you posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants