Agent possibly reporting Lambda function time outs incorrectly #22536

ryan-breidenbach · 2024-02-03T17:46:01Z

Agent Environment

We are using the following Datadog Lambda layers:

arn:aws:lambda:us-east-1:464622532012:layer:Datadog-Node18-x:96
arn:aws:lambda:us-east-1:464622532012:layer:Datadog-Extension:46

Our Lambda functions are using the Node.js 18.x runtime and x86_64 architecture.

Describe what happened:

On multiple occasions, we have seen error messages that look like this that we believe may not be correct.

The pattern we have observed is that there is an actual instance of a Lambda function timing out (i.e., the execution time exceeding the Lambda function’s configured time out). What happens next is subsequent invocations of that same Lambda function appear to execute successfully, and well under the 30 time out. Still, they are reported as time-out errors in Datadog. What is more peculiar is the time-out value that is reported is well under 30 seconds:

A specific occurrence of this happened this Thursday in one of our applications. Here are the corresponding Datadog logs. As you can see, the first execution did exceed 30s (31.16s). But all subsequent executions did not.

Further evidence that this is likely an issue attributed to Datadog is we do not see these logs in CloudWatch. Here is a screenshot from CloudWatch for that exact time range as the Datadog query above, and we only see the first instance of the “Task timed out” error message.

The fact that we are also sending these logs to Datadog, but the incorrect time-out logs only appear in Datadog leads me to believe that this incorrect behavior is something caused by Datadog.

I dug into the code for the Datadog agent, and there is one place in that code where it is explicitly sending a “Task timed out” message:

// createStringRecordForTimeoutLog returns the `Task timed out` log using the platform.report message
func createStringRecordForTimeoutLog(l *LambdaLogAPIMessage) string {
	durationMs := l.objectRecord.reportLogItem.durationMs
	durationSeconds := durationMs / 1000
	timeStr := l.time.Format(time.RFC3339Nano)
	return fmt.Sprintf(`%s %s Task timed out after %.2f seconds`, timeStr, l.objectRecord.requestID, durationSeconds)
}

And it looks like this function is invoked in exactly one place:

isTimeoutLog := message.logType == logTypePlatformReport && message.objectRecord.status == timeoutStatus
if isTimeoutLog {
    lc.out <- logConfig.NewChannelMessageFromLambda([]byte(createStringRecordForTimeoutLog(&message)), message.time, lc.arn, message.objectRecord.requestID, isTimeoutLog)
}

I am not sure in what circumstances this block of code will be executed. It feels like the state of the log messages becomes “tainted” after a time-out, and subsequent log events are erroneously categorized as time-outs.

Describe what you expected:

We only expect to see a Task timed out message in our logs when a Lambda function actually times out.

Steps to reproduce the issue:

Have a Lambda function exceed its configured time-out value, then invoke that same function again.

Additional environment details (Operating System, Cloud provider, etc):

See the information provided above regarding our Lambda function configuration.

The text was updated successfully, but these errors were encountered:

ryan-breidenbach added the team/triage label Feb 3, 2024

sgnn7 added team/serverless and removed team/triage labels Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent possibly reporting Lambda function time outs incorrectly #22536

Agent possibly reporting Lambda function time outs incorrectly #22536

ryan-breidenbach commented Feb 3, 2024

Agent possibly reporting Lambda function time outs incorrectly #22536

Agent possibly reporting Lambda function time outs incorrectly #22536

Comments

ryan-breidenbach commented Feb 3, 2024