Skip to content

ddtrace/tracer: enable trace writer to optionally retry on failure. #1636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 1, 2023

Conversation

purple4reina
Copy link
Contributor

What does this PR do?

This pull request implements trace writer retries when there is a failure. It does so by cloning the payload before sending it. Retries are optional and are enabled by default when running in AWS Lambda.

Note that this change introduces an extra allocation for every send attempt. However, it only does so when the tracer is configured for retries. Therefore, the extra allocation will only occur when run in lambda. I feel this is a reasonable trade off.

Motivation

In very rare cases, we are seeing errors like

2022/12/19 21:09:41 Datadog Tracer v1.45.1 ERROR: lost 1 traces: Post "http://localhost:8126/v0.4/traces": read tcp 127.0.0.1:44108->127.0.0.1:8126: read: connection reset by peer ([send duration: 0.327196ms]) (occurred: 19 Dec 22 21:07 UTC)

2022/12/19 21:15:44 Datadog Tracer v1.45.1 ERROR: lost 1 traces: Post "http://localhost:8126/v0.4/traces": write tcp 127.0.0.1:45932->127.0.0.1:8126: write: broken pipe ([send duration: 0.225527ms]) (occurred: 19 Dec 22 21:14 UTC)

2022/12/19 21:17:54 Datadog Tracer v1.45.1 ERROR: lost 1 traces: Post "http://localhost:8126/v0.4/traces": context deadline exceeded (Client.Timeout exceeded while awaiting headers) ([send duration: 19.249s]) (occurred: 19 Dec 22 21:14 UTC)

While increasing the timeout helps, you can see how some failures happen before the timeout is hit. This is because the datadog lambda extension has been paused in the middle of the request. When this is done, the connection is abruptly closed.

Describe how to test/QA your changes

These errors were happening very infrequently, less than 0.1% of the time. Therefore, testing means heavily executing a function in AWS Lambda and checking the logs for any errors.

Reviewer's Checklist

  • If known, an appropriate milestone has been selected; otherwise the Triage milestone is set.
  • Changed code has unit tests for its functionality.
  • If this interacts with the agent in a new way, a system test has been added.

Sorry, something went wrong.

@purple4reina purple4reina added this to the Triage milestone Dec 23, 2022
@purple4reina purple4reina requested a review from a team December 23, 2022 21:35
@purple4reina purple4reina force-pushed the rey.abolofia/flush-retries branch 2 times, most recently from cddc51b to f8e161a Compare January 3, 2023 17:14
@pr-commenter
Copy link

pr-commenter bot commented Jan 3, 2023

Benchmarks

Comparing candidate commit 53138cf in PR branch rey.abolofia/flush-retries with baseline commit b5ebd0e in branch main.

Found 0 performance improvements and 1 performance regressions! Performance is the same for 5 cases.

Copy link
Contributor

@knusbaum knusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding retries is ok in general, and we should add it in a way that isn't specifically tied to Serverless only. I'd like to see retries have their own option to configure them.

Comment on lines 99 to 101
if retries > 0 {
p = oldp.clone()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering why we would want to clone the payload. I think at this point it should be reusable for multiple send attempts.

Did you find any issue reusing?

log.Debug("Sending payload: size: %d traces: %d\n", size, count)
rc, err := h.config.transport.send(p)
if err != nil {
if retries > 0 && attempt != retries {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a branch that is detecting terminating conditions for the loop. Perhaps it would be better to have the loop do the retries, and handle errors outside. This would also reduce the nesting in the loop.

Perhaps something like:

for retries {
    err := send()
    if err != nil {
        log()
        sleep()
        continue
    }
    // success
    // set stats ...
    return
}

// Loop reached end, meaning tracer dropped
// set stats ...
log()

size, count := p.size(), p.itemCount()
log.Debug("Sending payload: size: %d traces: %d\n", size, count)
rc, err := h.config.transport.send(p)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should check for client errors here. If we're getting 400 errors (other than 429 Too Many Requests) then we probably don't want to try again. WDYT?

@@ -540,6 +544,7 @@ func WithDebugMode(enabled bool) StartOption {
func WithLambdaMode(enabled bool) StartOption {
return func(c *config) {
c.logToStdout = enabled
c.sendRetries = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabling this here will not do anything, since logToStdout causes the tracer to use the logTraceWriter instead of the agentTraceWriter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending data to stdout is now optional. By default we now use the regular agentTraceWriter. See https://github.com/DataDog/datadog-lambda-go/blob/6b73b08221845f90542e1ea7539f27da26715c6e/internal/trace/listener.go#L72.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now that this can still work if you specify WithLambdaMode(false), but to me this is confusing.

Another option would be more appropriate I think.

@knusbaum knusbaum changed the title [Serverless] Enable trace writer to optionally retry on failure. ddtrace/tracer: enable trace writer to optionally retry on failure. Jan 3, 2023
@purple4reina purple4reina force-pushed the rey.abolofia/flush-retries branch 2 times, most recently from de0aa5f to 1ac1705 Compare January 4, 2023 22:44
@purple4reina purple4reina requested a review from knusbaum January 6, 2023 15:34
Copy link
Contributor

@knusbaum knusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nearly there, this is looking good.

func WithLambdaMode(enabled bool) StartOption {
return func(c *config) {
c.logToStdout = enabled
}
}

// WithSendRetries enables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like

// WithSendRetries enables re-sending payloads that are not successfully submitted to the agent.
// This will cause the tracer to retry the send at most `retries` times.

Just a suggestion.

Comment on lines 80 to 81
// reset sets up the payload to be read a second time. It maintains the
// underlying byte contents of the buffer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's actually not delete this comment. The original idea behind reset was to reuse payload for another set of traces after a successful delivery by resetting the buffer to empty. This (in principle) would save memory since the same backing byte buffer could be used, avoiding GC churn.

This note is here to remind everyone why we don't do this (after having tried). I think this should be updated and moved to the payload type documentation, along with the warning inside the function since reset is no longer a function that prepares the payload for reuse. Basically, a note on the type about why we don't/can't reuse payload objects for multiple payloads.

Comment on lines 117 to 118
h.statsd.Count("datadog.tracer.traces_dropped", int64(count), []string{"reason:send_failed"}, 1)
log.Error("lost %d traces: %v", count, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 lines can be moved after the for loop, and we can get rid of this if statement.

// a memory leak when references to this object may still be kept by faulty transport
// implementations or the standard library. See dd-trace-go#976
p.buf = bytes.Buffer{}
p.reader = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this a payload method. We don't want to be messing with the payload internals here.

@purple4reina purple4reina force-pushed the rey.abolofia/flush-retries branch from 004ca3b to 3cc6be4 Compare January 18, 2023 20:56
knusbaum
knusbaum previously approved these changes Jan 19, 2023
Copy link
Contributor

@knusbaum knusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@knusbaum knusbaum modified the milestones: Triage, v1.47.0 Jan 25, 2023
@knusbaum knusbaum merged commit 648c900 into main Feb 1, 2023
@knusbaum knusbaum deleted the rey.abolofia/flush-retries branch February 1, 2023 19:09
@knusbaum knusbaum modified the milestones: v1.47.0, v1.48.0 Feb 1, 2023
dianashevchenko pushed a commit that referenced this pull request Feb 9, 2023
…1636)

This pull request implements trace writer retries when there is a
failure. It does so by cloning the payload before sending it. Retries are
optional and are enabled by default when running in AWS Lambda.

Note that this change introduces an extra allocation for every send
attempt. However, it only does so when the tracer is configured for
retries. Therefore, the extra allocation will only occur when run in
lambda. I feel this is a reasonable trade off.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants