ddtrace/tracer: enable trace writer to optionally retry on failure. #1636

purple4reina · 2022-12-23T21:35:47Z

What does this PR do?

This pull request implements trace writer retries when there is a failure. It does so by cloning the payload before sending it. Retries are optional and are enabled by default when running in AWS Lambda.

Note that this change introduces an extra allocation for every send attempt. However, it only does so when the tracer is configured for retries. Therefore, the extra allocation will only occur when run in lambda. I feel this is a reasonable trade off.

Motivation

In very rare cases, we are seeing errors like

2022/12/19 21:09:41 Datadog Tracer v1.45.1 ERROR: lost 1 traces: Post "http://localhost:8126/v0.4/traces": read tcp 127.0.0.1:44108->127.0.0.1:8126: read: connection reset by peer ([send duration: 0.327196ms]) (occurred: 19 Dec 22 21:07 UTC)

2022/12/19 21:15:44 Datadog Tracer v1.45.1 ERROR: lost 1 traces: Post "http://localhost:8126/v0.4/traces": write tcp 127.0.0.1:45932->127.0.0.1:8126: write: broken pipe ([send duration: 0.225527ms]) (occurred: 19 Dec 22 21:14 UTC)

2022/12/19 21:17:54 Datadog Tracer v1.45.1 ERROR: lost 1 traces: Post "http://localhost:8126/v0.4/traces": context deadline exceeded (Client.Timeout exceeded while awaiting headers) ([send duration: 19.249s]) (occurred: 19 Dec 22 21:14 UTC)

While increasing the timeout helps, you can see how some failures happen before the timeout is hit. This is because the datadog lambda extension has been paused in the middle of the request. When this is done, the connection is abruptly closed.

Describe how to test/QA your changes

These errors were happening very infrequently, less than 0.1% of the time. Therefore, testing means heavily executing a function in AWS Lambda and checking the logs for any errors.

Reviewer's Checklist

If known, an appropriate milestone has been selected; otherwise the Triage milestone is set.
Changed code has unit tests for its functionality.
If this interacts with the agent in a new way, a system test has been added.

pr-commenter · 2023-01-03T17:18:16Z

Benchmarks

Comparing candidate commit 53138cf in PR branch rey.abolofia/flush-retries with baseline commit b5ebd0e in branch main.

Found 0 performance improvements and 1 performance regressions! Performance is the same for 5 cases.

knusbaum

Adding retries is ok in general, and we should add it in a way that isn't specifically tied to Serverless only. I'd like to see retries have their own option to configure them.

knusbaum · 2023-01-03T17:48:49Z

ddtrace/tracer/writer.go

+			if retries > 0 {
+				p = oldp.clone()
+			}


Wondering why we would want to clone the payload. I think at this point it should be reusable for multiple send attempts.

Did you find any issue reusing?

knusbaum · 2023-01-03T17:52:12Z

ddtrace/tracer/writer.go

+			log.Debug("Sending payload: size: %d traces: %d\n", size, count)
+			rc, err := h.config.transport.send(p)
+			if err != nil {
+				if retries > 0 && attempt != retries {


This is a branch that is detecting terminating conditions for the loop. Perhaps it would be better to have the loop do the retries, and handle errors outside. This would also reduce the nesting in the loop.

Perhaps something like:

for retries { err := send() if err != nil { log() sleep() continue } // success // set stats ... return } // Loop reached end, meaning tracer dropped // set stats ... log()

knusbaum · 2023-01-03T17:56:02Z

ddtrace/tracer/writer.go

+			size, count := p.size(), p.itemCount()
+			log.Debug("Sending payload: size: %d traces: %d\n", size, count)
+			rc, err := h.config.transport.send(p)
+			if err != nil {


Perhaps we should check for client errors here. If we're getting 400 errors (other than 429 Too Many Requests) then we probably don't want to try again. WDYT?

knusbaum · 2023-01-03T18:00:15Z

ddtrace/tracer/option.go

@@ -540,6 +544,7 @@ func WithDebugMode(enabled bool) StartOption {
 func WithLambdaMode(enabled bool) StartOption {
 	return func(c *config) {
 		c.logToStdout = enabled
+		c.sendRetries = 2


Enabling this here will not do anything, since logToStdout causes the tracer to use the logTraceWriter instead of the agentTraceWriter.

Sending data to stdout is now optional. By default we now use the regular agentTraceWriter. See https://github.com/DataDog/datadog-lambda-go/blob/6b73b08221845f90542e1ea7539f27da26715c6e/internal/trace/listener.go#L72.

I see now that this can still work if you specify WithLambdaMode(false), but to me this is confusing.

Another option would be more appropriate I think.

knusbaum

Nearly there, this is looking good.

knusbaum · 2023-01-09T19:09:28Z

ddtrace/tracer/option.go

 func WithLambdaMode(enabled bool) StartOption {
 	return func(c *config) {
 		c.logToStdout = enabled
 	}
 }

+// WithSendRetries enables


Maybe something like

// WithSendRetries enables re-sending payloads that are not successfully submitted to the agent. // This will cause the tracer to retry the send at most `retries` times.

Just a suggestion.

knusbaum · 2023-01-09T19:21:50Z

ddtrace/tracer/payload.go

+// reset sets up the payload to be read a second time. It maintains the
+// underlying byte contents of the buffer.


Let's actually not delete this comment. The original idea behind reset was to reuse payload for another set of traces after a successful delivery by resetting the buffer to empty. This (in principle) would save memory since the same backing byte buffer could be used, avoiding GC churn.

This note is here to remind everyone why we don't do this (after having tried). I think this should be updated and moved to the payload type documentation, along with the warning inside the function since reset is no longer a function that prepares the payload for reuse. Basically, a note on the type about why we don't/can't reuse payload objects for multiple payloads.

knusbaum · 2023-01-09T19:29:32Z

ddtrace/tracer/writer.go

+				h.statsd.Count("datadog.tracer.traces_dropped", int64(count), []string{"reason:send_failed"}, 1)
+				log.Error("lost %d traces: %v", count, err)


These 2 lines can be moved after the for loop, and we can get rid of this if statement.

ddtrace/tracer/payload_test.go

knusbaum · 2023-01-09T19:33:03Z

ddtrace/tracer/writer.go

+			// a memory leak when references to this object may still be kept by faulty transport
+			// implementations or the standard library. See dd-trace-go#976
+			p.buf = bytes.Buffer{}
+			p.reader = nil


Let's make this a payload method. We don't want to be messing with the payload internals here.

knusbaum

Looks good!

…1636) This pull request implements trace writer retries when there is a failure. It does so by cloning the payload before sending it. Retries are optional and are enabled by default when running in AWS Lambda. Note that this change introduces an extra allocation for every send attempt. However, it only does so when the tracer is configured for retries. Therefore, the extra allocation will only occur when run in lambda. I feel this is a reasonable trade off.

purple4reina added this to the Triage milestone Dec 23, 2022

purple4reina requested a review from a team December 23, 2022 21:35

purple4reina force-pushed the rey.abolofia/flush-retries branch 2 times, most recently from cddc51b to f8e161a Compare January 3, 2023 17:14

knusbaum requested changes Jan 3, 2023

View reviewed changes

knusbaum changed the title ~~[Serverless] Enable trace writer to optionally retry on failure.~~ ddtrace/tracer: enable trace writer to optionally retry on failure. Jan 3, 2023

purple4reina force-pushed the rey.abolofia/flush-retries branch 2 times, most recently from de0aa5f to 1ac1705 Compare January 4, 2023 22:44

purple4reina requested a review from knusbaum January 6, 2023 15:34

knusbaum requested changes Jan 9, 2023

View reviewed changes

purple4reina force-pushed the rey.abolofia/flush-retries branch from 004ca3b to 3cc6be4 Compare January 18, 2023 20:56

knusbaum previously approved these changes Jan 19, 2023

View reviewed changes

purple4reina added 9 commits January 23, 2023 08:10

Retry flushing traces when running in lambda.

8e072da

Test trace writer flush retries.

9d5aea7

Maintain the same bytes.Buffer for each payload send.

3e6ed2f

Separate retries configuration to separate option.

0331c57

Remove reader when done sending.

4113505

Test count metrics created.

0528091

Simplify loop logic.

4fca517

Lock reads from test statsd client.

36d15c9

Address PR feedback.

59e52bd

purple4reina dismissed knusbaum’s stale review via 59e52bd January 23, 2023 16:10

purple4reina force-pushed the rey.abolofia/flush-retries branch from 1ef2ea0 to 59e52bd Compare January 23, 2023 16:10

knusbaum added 2 commits January 23, 2023 14:02

Merge branch 'main' into rey.abolofia/flush-retries

f8bed0a

Merge branch 'main' into rey.abolofia/flush-retries

ccec704

knusbaum approved these changes Jan 25, 2023

View reviewed changes

knusbaum modified the milestones: Triage, v1.47.0 Jan 25, 2023

Merge branch 'main' into rey.abolofia/flush-retries

53138cf

knusbaum merged commit 648c900 into main Feb 1, 2023

knusbaum deleted the rey.abolofia/flush-retries branch February 1, 2023 19:09

knusbaum modified the milestones: v1.47.0, v1.48.0 Feb 1, 2023

purple4reina mentioned this pull request Feb 21, 2023

Retry sending trace payloads on failure. DataDog/datadog-lambda-go#128

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddtrace/tracer: enable trace writer to optionally retry on failure. #1636

ddtrace/tracer: enable trace writer to optionally retry on failure. #1636

purple4reina commented Dec 23, 2022

pr-commenter bot commented Jan 3, 2023 •

edited

Loading

knusbaum left a comment

knusbaum Jan 3, 2023

knusbaum Jan 3, 2023

knusbaum Jan 3, 2023

knusbaum Jan 3, 2023

purple4reina Jan 3, 2023

knusbaum Jan 3, 2023

knusbaum left a comment

knusbaum Jan 9, 2023

knusbaum Jan 9, 2023

knusbaum Jan 9, 2023

knusbaum Jan 9, 2023

knusbaum left a comment

		// reset sets up the payload to be read a second time. It maintains the
		// underlying byte contents of the buffer.

		h.statsd.Count("datadog.tracer.traces_dropped", int64(count), []string{"reason:send_failed"}, 1)
		log.Error("lost %d traces: %v", count, err)

ddtrace/tracer: enable trace writer to optionally retry on failure. #1636

ddtrace/tracer: enable trace writer to optionally retry on failure. #1636

Conversation

purple4reina commented Dec 23, 2022

What does this PR do?

Motivation

Describe how to test/QA your changes

Reviewer's Checklist

pr-commenter bot commented Jan 3, 2023 • edited Loading

Benchmarks

knusbaum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knusbaum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

knusbaum left a comment

Choose a reason for hiding this comment

pr-commenter bot commented Jan 3, 2023 •

edited

Loading