Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent possibly reporting Lambda function time outs incorrectly #22536

Open
ryan-breidenbach opened this issue Feb 3, 2024 · 0 comments
Open

Comments

@ryan-breidenbach
Copy link

Agent Environment

We are using the following Datadog Lambda layers:

  • arn:aws:lambda:us-east-1:464622532012:layer:Datadog-Node18-x:96
  • arn:aws:lambda:us-east-1:464622532012:layer:Datadog-Extension:46

Our Lambda functions are using the Node.js 18.x runtime and x86_64 architecture.

Describe what happened:

On multiple occasions, we have seen error messages that look like this that we believe may not be correct.

image

The pattern we have observed is that there is an actual instance of a Lambda function timing out (i.e., the execution time exceeding the Lambda function’s configured time out). What happens next is subsequent invocations of that same Lambda function appear to execute successfully, and well under the 30 time out. Still, they are reported as time-out errors in Datadog. What is more peculiar is the time-out value that is reported is well under 30 seconds:

image

A specific occurrence of this happened this Thursday in one of our applications. Here are the corresponding Datadog logs. As you can see, the first execution did exceed 30s (31.16s). But all subsequent executions did not.

image

Further evidence that this is likely an issue attributed to Datadog is we do not see these logs in CloudWatch. Here is a screenshot from CloudWatch for that exact time range as the Datadog query above, and we only see the first instance of the “Task timed out” error message.

image

The fact that we are also sending these logs to Datadog, but the incorrect time-out logs only appear in Datadog leads me to believe that this incorrect behavior is something caused by Datadog.

I dug into the code for the Datadog agent, and there is one place in that code where it is explicitly sending a “Task timed out” message:

// createStringRecordForTimeoutLog returns the `Task timed out` log using the platform.report message
func createStringRecordForTimeoutLog(l *LambdaLogAPIMessage) string {
	durationMs := l.objectRecord.reportLogItem.durationMs
	durationSeconds := durationMs / 1000
	timeStr := l.time.Format(time.RFC3339Nano)
	return fmt.Sprintf(`%s %s Task timed out after %.2f seconds`, timeStr, l.objectRecord.requestID, durationSeconds)
}

And it looks like this function is invoked in exactly one place:

isTimeoutLog := message.logType == logTypePlatformReport && message.objectRecord.status == timeoutStatus
if isTimeoutLog {
    lc.out <- logConfig.NewChannelMessageFromLambda([]byte(createStringRecordForTimeoutLog(&message)), message.time, lc.arn, message.objectRecord.requestID, isTimeoutLog)
}

I am not sure in what circumstances this block of code will be executed. It feels like the state of the log messages becomes “tainted” after a time-out, and subsequent log events are erroneously categorized as time-outs.

Describe what you expected:

We only expect to see a Task timed out message in our logs when a Lambda function actually times out.

Steps to reproduce the issue:

Have a Lambda function exceed its configured time-out value, then invoke that same function again.

Additional environment details (Operating System, Cloud provider, etc):

See the information provided above regarding our Lambda function configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants