Handle timeouts more gracefully by allowing the application to shutdown #895

Nyholm · 2021-03-25T14:17:25Z

This PR builds on the work started by @t-geindre in #55.

If we are 1 second away from Lambda shutting down, then thrown an exception to let the application shutdown and (more importantly) flush the logs.

There are two things Im not super happy with:

~~This does not work for PHP-FPM since it the application run in a child process~~
I get a segmentation fault when SIGALRM is triggers while the application is waiting for IO (https://bugs.php.net/bug.php?id=80907)

The question is: do we want to continue down this path?

It would solve issues like #873 and #862.

TODO

Add docs
Do some manual testing to make sure this works as expected

deleugpn · 2021-03-25T14:47:43Z

I get a segmentation fault when SIGALRM is triggers while the application is waiting for IO

Isn't every connection timeout an IO operation that PHP is waiting?

This does not work for PHP-FPM since it the application run in a child process

This seems to me like a strong enough limitation that may indicate it's not worth it to support it? It does bring a nice idea to the table though: what if we offer a Debugging Layer where you can write a basic function invocation and run the code that's getting timeout from it so that you might get a better idea where it's stopping?
By making it an explicit debugging layer, we don't need to guarantee production-ready quality and perhaps we can even try to include default debugging statements that will get logged? In the end people will use this in order to figure out why their Lambda can't connect somewhere (VPC, Firewall, Security Group, NAT Gateway, Internet Gateway, NACL, etc) which has nothing to do with the code itself.

mnapoli · 2021-03-25T15:02:39Z

Hey, I wonder if something similar could be used in the PHP-FPM worker (the process handling the request)?

That could be done as a Symfony/Laravel package?

I'm not familiar enough with signals like that to know if that could work.

Also another solution is to find a way for logs to be streamed correctly. Or else, find a way to force shorter timeouts on all connections in all frameworks, but I have no idea if that's even remotely possible.

Nyholm · 2021-03-25T15:10:42Z

Isn't every connection timeout an IO operation that PHP is waiting?

Yeah.. Each curl request, file load or sleep is IO operations. But this is a PHP bug that I assume will be fixed soon. (I updated the PR description with a bug report).

This seems to me like a strong enough limitation that may indicate it's not worth it to support it

Im not sure.. I strongly believe in a future were all PHP applications are using a runtime. See #889.

I wonder if something similar could be used in the PHP-FPM worker

Probably, yes.

That could be done as a Symfony/Laravel package?

Yes, it is called runtime/bref =)

Also another solution is to find a way for logs to be streamed correctly.

I dont think that is possible. Or.. there is a difference between application logs and FPM logs. For the FPM logs => I have no idea.
For the application logs: Monolog buffers the logs and we need to tell the logger to flush the buffer (ie, destruct the logger). So it needs an exception to be thrown.

Debug layer

I like the idea, But it does not solve the problem where 1/10000 request to my live application times out and I want to find out what happens. If the "debug layer" is not production ready, the I would not be happy adding it to my production environment.

Nyholm · 2021-03-25T21:16:55Z

I reorganized things.

I've updated LambdaRuntime to get a bool parameter to enable throwing TimeoutException. This is enabled in the bootstrap files in php-xx and console. It is not enabled in the php-fpm. For FPM the user needs to update their application by adding Bref\Timeout\Timeout::enable(). This will read the environment variables and set the timeout.

I think this is a good way forward. I would like to get another review before I add tests and docs.

Nyholm · 2021-03-25T22:38:36Z

You may also control this behavior with BREF_TIMEOUT.
Value 0 is "auto". Value "-1" is disabled. Positive values sets the timeout.

src/Runtime/LambdaRuntime.php

tests/Timeout/TimeoutTest.php

Nyholm · 2021-03-26T07:39:13Z

A PHP-FPM user would have to write something like:

if (isset($_SERVER['LAMBDA_TASK_ROOT'])) {
    Bref\Timeout\Timeout::enable();
}

Preferably as soon as possible, ie Kernel::handle() or index.php.

src/Timeout/Timeout.php

src/Runtime/LambdaRuntime.php

src/Timeout/Timeout.php

mnapoli · 2021-03-26T15:32:44Z

Thanks, following our discussions I've opened Nyholm#3 to remove the option of configuring the duration timeout via an environment variable.

@Nyholm

* Always base the timeout duration on the configured Lambda timeout * Make sure that we can never enable the Bref timeout in the FPM bootstrap process * Remove unnecessary private method `timeoutAfter()` * Allow using falsy values for `BREF_FEATURE_TIMEOUT`, e.g. `0` or `''`, to allow disabling the feature * Cleanup global state * Improve tests following @Nyholm's feedback

Nyholm · 2021-03-30T09:43:48Z

The support PR is merged. This PR is now ready for a final rounds of review.

Nyholm

Im 👍

We don't mention that the logs from php-fpm also will be flushed, but that is a technical detail I think it is okey to skip.

mnapoli · 2021-03-30T11:41:06Z

src/Timeout/Timeout.php

+                error_log('Original stack trace');
+                error_log(Timeout::$stackTrace);
+                Timeout::$stackTrace = null;
+                exit(1);


Should we force a 500 status code before exit?

That would only be taken into account in the FPM layer, so that wouldn't really impact the php-XX layer.

How?
And wouldn't that effect my AWS::ApiGateway::GatewayResponse that triggers on timeouts or 500?

http_response_code(500);

And wouldn't that effect my AWS::ApiGateway::GatewayResponse that triggers on timeouts or 500?

🤔 not sure about that, maybe. But in any case, the same would happen if your app returns a pretty 500 error page.

Does that make sense? What I mean is that, when using FPM, there would be no difference between:

Symfony catches any exception and displays an error page with a 500 status code

Bref timeout kicks in and forces a 500 status code

In both cases, API Gateway would behave the same way.

Oh, no that is not true.

Symfony catches any exception and displays an error page with a 500 status code

This will show you a nice error page. API Gateway does nothing

Bref timeout kicks in and forces a 500 status code

No content(HTML) is provided. I would like the custom error page in API Gateway to be used.

Im not sure if http_response_code(500) would make any difference.

Also note. We are not sure it is a HTTP context. It could be any other event.. So Im not sure about sending a HTTP status code.

src/Timeout/Timeout.php

mnapoli · 2021-03-30T14:16:02Z

src/Timeout/Timeout.php

+                error_log('Lambda timed out');
+                error_log((new LambdaTimeout)->getTraceAsString());
+                error_log('Original stack trace');
+                error_log(Timeout::$stackTrace);


For the record, I'm testing and notice that error_log adds noise in FPM:

NOTICE: PHP message: Lambda timed out NOTICE: PHP message: #0 /var/task/demo/http.php(16): Bref\Timeout\Timeout::Bref\Timeout\{closure}() #1 {main} NOTICE: PHP message: Original stack trace NOTICE: PHP message: #0 /var/task/demo/http.php(9): Bref\Timeout\Timeout::Bref\Timeout\{closure}() #1 {main}

I'm changing that to:

$stderr = fopen('php://stderr', 'w'); fwrite($stderr, 'Lambda timed out' . PHP_EOL); fwrite($stderr, (new LambdaTimeout)->getTraceAsString() . PHP_EOL); fwrite($stderr, 'Original stack trace' . PHP_EOL); fwrite($stderr, Timeout::$stackTrace . PHP_EOL); fclose($stderr);

Which gives a cleaner log:

Lambda timed out #0 /var/task/demo/http.php(16): Bref\Timeout\Timeout::Bref\Timeout\{closure}() #1 {main} Original stack trace #0 /var/task/demo/http.php(9): Bref\Timeout\Timeout::Bref\Timeout\{closure}() #1 {main}

No need to update the PR

Oh, I didnt know.

Thank you for testing.

mnapoli · 2021-03-30T15:30:08Z

OK following discussion through Slack, we were happy with the PR.

However I tested and I'm getting weird results in FPM too:

1st request: the timeouts work as expected
following requests: pcntl_alarm(...) is triggered, but our custom handler isn't called. That causes pending I/O (or sleep) to be interrupted, but without any exception being triggered

I've been testing locally with PHP-FPM to pinpoint this, but my guess currently is that pcntl signals may not work well with FPM and its worker pool. Maybe we first set a signal handler on the process, and it's not correctly cleaned up by FPM on the next request cycle?

I tried cleaning up manually the handlers, no changes. I tried putting the cleanup in a __destruct() to be sure it's executed, no changes. I'm not sure what to do anymore…

How to reproduce locally:

docker-compose up
add \Bref\Timeout\Timeout::enableInFpm(); at the top of demo/http.php
open http://localhost:8000/?sleep=1
play with demo/http.php in the sleep block (e.g. catch the exception)

E.g. in my case:

if (isset($_GET['sleep'])) {
    echo 'Sleeping';
    try {
        sleep(6);
    } catch (Throwable $e) {
        echo 'CAUGHT!';
    }
    echo 'Sleeping';
    sleep(2);
}

first request: I see CAUGHT!
next requests: the catch is not executed, yet the sleep() calls are interrupted (without errors/warnings)

When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug. With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection). If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which: - flushes the logs (logs end up in CloudWatch, which is great) - restarts a clean FPM worker, without doing a full FPM restart (which may take longer) Follow up of #770, #772, #895 May address some of #862 Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.

mnapoli · 2022-01-04T16:20:16Z

Since this wasn't working with FPM I went with a different route: #1133

This PR might still be worth following for the Function layer, so I'll leave it open.

When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug. With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection). If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which: - flushes the logs (logs end up in CloudWatch, which is great) - restarts a clean FPM worker, without doing a full FPM restart (which may take longer) Follow up of #770, #772, #895 May address some of #862 Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.

Nyholm changed the title ~~POC handle timeouts~~ Handle timeouts more gracefully by allowing the application to shutdown Mar 25, 2021

Nyholm commented Mar 25, 2021

View reviewed changes

src/Runtime/LambdaRuntime.php Outdated Show resolved Hide resolved

GrahamCampbell suggested changes Mar 26, 2021

View reviewed changes

src/Runtime/LambdaRuntime.php Outdated Show resolved Hide resolved

Nyholm commented Mar 26, 2021

View reviewed changes

tests/Timeout/TimeoutTest.php Outdated Show resolved Hide resolved

tests/Timeout/TimeoutTest.php Outdated Show resolved Hide resolved

Nyholm force-pushed the timeouts branch from bc1ca82 to becb781 Compare March 26, 2021 08:08

mnapoli reviewed Mar 26, 2021

View reviewed changes

src/Timeout/Timeout.php Outdated Show resolved Hide resolved

Nyholm added 2 commits March 26, 2021 10:38

Handle timeouts more gracefully by throwing exception near the hard end

e78d2c6

Added documentation for Timeouts

5a57044

mnapoli reviewed Mar 26, 2021

View reviewed changes

src/Runtime/LambdaRuntime.php Outdated Show resolved Hide resolved

Nyholm force-pushed the timeouts branch from d5022ba to 5a57044 Compare March 26, 2021 09:38

mnapoli reviewed Mar 26, 2021

View reviewed changes

src/Timeout/Timeout.php Outdated Show resolved Hide resolved

Nyholm added 4 commits March 26, 2021 10:53

Adding a note about debugging

b89aa04

Make Timeout::init() private

52f8d65

Make this feature opt-in

37a5759

Make sure the shutdown process has at least 1 second.

a2e46ca

mnapoli mentioned this pull request Mar 28, 2021

Added Lambda and API Gateway request IDs to Cloudwatch logs when there's a PHP-FPM issue #865

Closed

Nyholm commented Mar 30, 2021

View reviewed changes

mnapoli reviewed Mar 30, 2021

View reviewed changes

src/Timeout/Timeout.php Outdated Show resolved Hide resolved

Do 2 alarms

9356e4a

Nyholm force-pushed the timeouts branch from 5f322a7 to 9356e4a Compare March 30, 2021 12:12

fixed tests

2efd449

mnapoli reviewed Mar 30, 2021

View reviewed changes

mnapoli added the enhancement label May 11, 2021

This was referenced Sep 19, 2021

Add docs about applications with Symfony Runtime #1032

Closed

Adding Layer to support Symfony Runtime component #889

Closed

Nyholm mentioned this pull request Oct 6, 2021

[Bref] Added support for handle timeouts php-runtime/runtime#95

Open

mnapoli mentioned this pull request Jan 4, 2022

Flush FPM logs in case of timeouts #1133

Merged

mnapoli mentioned this pull request Mar 7, 2023

Queue timeouts should trigger failed() call cachewerk/bref-laravel-bridge#24

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle timeouts more gracefully by allowing the application to shutdown #895

Handle timeouts more gracefully by allowing the application to shutdown #895

Nyholm commented Mar 25, 2021 •

edited

deleugpn commented Mar 25, 2021 •

edited

mnapoli commented Mar 25, 2021

Nyholm commented Mar 25, 2021

Nyholm commented Mar 25, 2021

Nyholm commented Mar 25, 2021

Nyholm commented Mar 26, 2021

mnapoli commented Mar 26, 2021

Nyholm commented Mar 30, 2021

Nyholm left a comment

mnapoli Mar 30, 2021

Nyholm Mar 30, 2021

mnapoli Mar 30, 2021

Nyholm Mar 30, 2021

mnapoli Mar 30, 2021

Nyholm Mar 30, 2021

mnapoli commented Mar 30, 2021 •

edited

mnapoli commented Jan 4, 2022

Handle timeouts more gracefully by allowing the application to shutdown #895

Are you sure you want to change the base?

Handle timeouts more gracefully by allowing the application to shutdown #895

Conversation

Nyholm commented Mar 25, 2021 • edited

deleugpn commented Mar 25, 2021 • edited

mnapoli commented Mar 25, 2021

Nyholm commented Mar 25, 2021

Nyholm commented Mar 25, 2021

Nyholm commented Mar 25, 2021

Nyholm commented Mar 26, 2021

mnapoli commented Mar 26, 2021

Nyholm commented Mar 30, 2021

Nyholm left a comment

Choose a reason for hiding this comment

mnapoli Mar 30, 2021

Choose a reason for hiding this comment

Nyholm Mar 30, 2021

Choose a reason for hiding this comment

mnapoli Mar 30, 2021

Choose a reason for hiding this comment

Nyholm Mar 30, 2021

Choose a reason for hiding this comment

mnapoli Mar 30, 2021

Choose a reason for hiding this comment

Nyholm Mar 30, 2021

Choose a reason for hiding this comment

mnapoli commented Mar 30, 2021 • edited

mnapoli commented Jan 4, 2022

Nyholm commented Mar 25, 2021 •

edited

deleugpn commented Mar 25, 2021 •

edited

mnapoli commented Mar 30, 2021 •

edited