-
Notifications
You must be signed in to change notification settings - Fork 462
ddtrace/tracer: enable trace writer to optionally retry on failure. #1636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cddc51b
to
f8e161a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding retries is ok in general, and we should add it in a way that isn't specifically tied to Serverless only. I'd like to see retries have their own option to configure them.
ddtrace/tracer/writer.go
Outdated
if retries > 0 { | ||
p = oldp.clone() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering why we would want to clone the payload. I think at this point it should be reusable for multiple send attempts.
Did you find any issue reusing?
ddtrace/tracer/writer.go
Outdated
log.Debug("Sending payload: size: %d traces: %d\n", size, count) | ||
rc, err := h.config.transport.send(p) | ||
if err != nil { | ||
if retries > 0 && attempt != retries { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a branch that is detecting terminating conditions for the loop. Perhaps it would be better to have the loop do the retries, and handle errors outside. This would also reduce the nesting in the loop.
Perhaps something like:
for retries {
err := send()
if err != nil {
log()
sleep()
continue
}
// success
// set stats ...
return
}
// Loop reached end, meaning tracer dropped
// set stats ...
log()
ddtrace/tracer/writer.go
Outdated
size, count := p.size(), p.itemCount() | ||
log.Debug("Sending payload: size: %d traces: %d\n", size, count) | ||
rc, err := h.config.transport.send(p) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should check for client errors here. If we're getting 400 errors (other than 429 Too Many Requests
) then we probably don't want to try again. WDYT?
ddtrace/tracer/option.go
Outdated
@@ -540,6 +544,7 @@ func WithDebugMode(enabled bool) StartOption { | |||
func WithLambdaMode(enabled bool) StartOption { | |||
return func(c *config) { | |||
c.logToStdout = enabled | |||
c.sendRetries = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enabling this here will not do anything, since logToStdout
causes the tracer to use the logTraceWriter
instead of the agentTraceWriter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending data to stdout is now optional. By default we now use the regular agentTraceWriter
. See https://github.com/DataDog/datadog-lambda-go/blob/6b73b08221845f90542e1ea7539f27da26715c6e/internal/trace/listener.go#L72.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now that this can still work if you specify WithLambdaMode(false)
, but to me this is confusing.
Another option would be more appropriate I think.
de0aa5f
to
1ac1705
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nearly there, this is looking good.
ddtrace/tracer/option.go
Outdated
func WithLambdaMode(enabled bool) StartOption { | ||
return func(c *config) { | ||
c.logToStdout = enabled | ||
} | ||
} | ||
|
||
// WithSendRetries enables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like
// WithSendRetries enables re-sending payloads that are not successfully submitted to the agent.
// This will cause the tracer to retry the send at most `retries` times.
Just a suggestion.
ddtrace/tracer/payload.go
Outdated
// reset sets up the payload to be read a second time. It maintains the | ||
// underlying byte contents of the buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's actually not delete this comment. The original idea behind reset
was to reuse payload
for another set of traces after a successful delivery by resetting the buffer to empty. This (in principle) would save memory since the same backing byte buffer could be used, avoiding GC churn.
This note is here to remind everyone why we don't do this (after having tried). I think this should be updated and moved to the payload
type documentation, along with the warning inside the function since reset
is no longer a function that prepares the payload for reuse. Basically, a note on the type about why we don't/can't reuse payload
objects for multiple payloads.
ddtrace/tracer/writer.go
Outdated
h.statsd.Count("datadog.tracer.traces_dropped", int64(count), []string{"reason:send_failed"}, 1) | ||
log.Error("lost %d traces: %v", count, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 2 lines can be moved after the for
loop, and we can get rid of this if
statement.
ddtrace/tracer/writer.go
Outdated
// a memory leak when references to this object may still be kept by faulty transport | ||
// implementations or the standard library. See dd-trace-go#976 | ||
p.buf = bytes.Buffer{} | ||
p.reader = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this a payload
method. We don't want to be messing with the payload internals here.
004ca3b
to
3cc6be4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
1ef2ea0
to
59e52bd
Compare
…1636) This pull request implements trace writer retries when there is a failure. It does so by cloning the payload before sending it. Retries are optional and are enabled by default when running in AWS Lambda. Note that this change introduces an extra allocation for every send attempt. However, it only does so when the tracer is configured for retries. Therefore, the extra allocation will only occur when run in lambda. I feel this is a reasonable trade off.
What does this PR do?
This pull request implements trace writer retries when there is a failure. It does so by cloning the payload before sending it. Retries are optional and are enabled by default when running in AWS Lambda.
Note that this change introduces an extra allocation for every send attempt. However, it only does so when the tracer is configured for retries. Therefore, the extra allocation will only occur when run in lambda. I feel this is a reasonable trade off.
Motivation
In very rare cases, we are seeing errors like
While increasing the timeout helps, you can see how some failures happen before the timeout is hit. This is because the datadog lambda extension has been paused in the middle of the request. When this is done, the connection is abruptly closed.
Describe how to test/QA your changes
These errors were happening very infrequently, less than 0.1% of the time. Therefore, testing means heavily executing a function in AWS Lambda and checking the logs for any errors.
Reviewer's Checklist
Triage
milestone is set.