You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our Lambda functions are using the Node.js 18.x runtime and x86_64 architecture.
Describe what happened:
On multiple occasions, we have seen error messages that look like this that we believe may not be correct.
The pattern we have observed is that there is an actual instance of a Lambda function timing out (i.e., the execution time exceeding the Lambda function’s configured time out). What happens next is subsequent invocations of that same Lambda function appear to execute successfully, and well under the 30 time out. Still, they are reported as time-out errors in Datadog. What is more peculiar is the time-out value that is reported is well under 30 seconds:
A specific occurrence of this happened this Thursday in one of our applications. Here are the corresponding Datadog logs. As you can see, the first execution did exceed 30s (31.16s). But all subsequent executions did not.
Further evidence that this is likely an issue attributed to Datadog is we do not see these logs in CloudWatch. Here is a screenshot from CloudWatch for that exact time range as the Datadog query above, and we only see the first instance of the “Task timed out” error message.
The fact that we are also sending these logs to Datadog, but the incorrect time-out logs only appear in Datadog leads me to believe that this incorrect behavior is something caused by Datadog.
I dug into the code for the Datadog agent, and there is one place in that code where it is explicitly sending a “Task timed out” message:
// createStringRecordForTimeoutLog returns the `Task timed out` log using the platform.report messagefunccreateStringRecordForTimeoutLog(l*LambdaLogAPIMessage)string{durationMs :=l.objectRecord.reportLogItem.durationMsdurationSeconds :=durationMs/1000timeStr :=l.time.Format(time.RFC3339Nano)returnfmt.Sprintf(`%s %s Task timed out after %.2f seconds`,timeStr,l.objectRecord.requestID,durationSeconds)}
I am not sure in what circumstances this block of code will be executed. It feels like the state of the log messages becomes “tainted” after a time-out, and subsequent log events are erroneously categorized as time-outs.
Describe what you expected:
We only expect to see a Task timed out message in our logs when a Lambda function actually times out.
Steps to reproduce the issue:
Have a Lambda function exceed its configured time-out value, then invoke that same function again.
Agent Environment
We are using the following Datadog Lambda layers:
arn:aws:lambda:us-east-1:464622532012:layer:Datadog-Node18-x:96
arn:aws:lambda:us-east-1:464622532012:layer:Datadog-Extension:46
Our Lambda functions are using the
Node.js 18.x
runtime andx86_64
architecture.Describe what happened:
On multiple occasions, we have seen error messages that look like this that we believe may not be correct.
The pattern we have observed is that there is an actual instance of a Lambda function timing out (i.e., the execution time exceeding the Lambda function’s configured time out). What happens next is subsequent invocations of that same Lambda function appear to execute successfully, and well under the 30 time out. Still, they are reported as time-out errors in Datadog. What is more peculiar is the time-out value that is reported is well under 30 seconds:
A specific occurrence of this happened this Thursday in one of our applications. Here are the corresponding Datadog logs. As you can see, the first execution did exceed 30s (31.16s). But all subsequent executions did not.
Further evidence that this is likely an issue attributed to Datadog is we do not see these logs in CloudWatch. Here is a screenshot from CloudWatch for that exact time range as the Datadog query above, and we only see the first instance of the “Task timed out” error message.
The fact that we are also sending these logs to Datadog, but the incorrect time-out logs only appear in Datadog leads me to believe that this incorrect behavior is something caused by Datadog.
I dug into the code for the Datadog agent, and there is one place in that code where it is explicitly sending a “Task timed out” message:
And it looks like this function is invoked in exactly one place:
I am not sure in what circumstances this block of code will be executed. It feels like the state of the log messages becomes “tainted” after a time-out, and subsequent log events are erroneously categorized as time-outs.
Describe what you expected:
We only expect to see a
Task timed out
message in our logs when a Lambda function actually times out.Steps to reproduce the issue:
Have a Lambda function exceed its configured time-out value, then invoke that same function again.
Additional environment details (Operating System, Cloud provider, etc):
See the information provided above regarding our Lambda function configuration.
The text was updated successfully, but these errors were encountered: