Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/telemetry: track tracer init time metric #1896

Merged
merged 26 commits into from
Apr 18, 2023
Merged

Conversation

lievan
Copy link
Contributor

@lievan lievan commented Apr 12, 2023

What does this PR do?

Implements TimeGauge telemetry function and Distributions telemetry event to track tracer init time via the dd.instrumentation_telemetry_data.tracers.tracer_init_time metric.

Motivation

Describe how to test/QA your changes

Start a sample app with tracing and turn on the DD_INSTRUMENTATION_TELEMETRY_DEBUG flag.

  1. Search for dd.instrumentation_telemetry_data.tracers.tracer_init_time in the staging metrics explorer.

image

  1. Query for service:tracer-telemetry-intake @payload_dog_tag.request_type:distributions in staging logs

image

Reviewer's Checklist

  • Changed code has unit tests for its functionality.
  • If this interacts with the agent in a new way, a system test has been added.

Sorry, something went wrong.

@lievan lievan added this to the v1.50.0 milestone Apr 12, 2023
@lievan lievan requested a review from a team as a code owner April 12, 2023 19:35
@lievan lievan requested a review from a team April 12, 2023 19:35
@pr-commenter
Copy link

pr-commenter bot commented Apr 12, 2023

Benchmarks

Comparing candidate commit 160e661 in PR branch evan.li/time-gauge-2 with baseline commit 361d193 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 18 metrics, 0 unstable metrics.

@lievan lievan marked this pull request as draft April 12, 2023 20:21

// copy over requests so we can do the actual submission without holding
// the lock. Zero out the old stuff so we don't leak references
for i, r := range c.requests {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to populate the submissions slice before adding metrics events so that the sequence id's are in the right order. Sequence id represents the order in which telemetry events happened. newRequest increments the client's sequence id, and newRequest for metrics is only called inside the flush function.

system tests were failing because the submissions array had out-of-order sequence id's, for example:
[6, 1, 2, 3, 4, 5].
Telemetry system tests does not expect the events to arrive in a strict order, but it will assert that out-of-order events arrive within a certain timeframe of each other (0.3 seconds), and in this example event 6 was arriving way before event 5.

server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if r.Header.Get("DD-Telemetry-Request-Type") == string(RequestTypeAppClosing) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't rely on app-closing to flush the metrics anymore, since app-closing is sent before metrics. Now we'll just hard code the expected metric messages and wait until we've hit that count.

@lievan lievan marked this pull request as ready for review April 14, 2023 21:18
Copy link
Contributor

@ajgajg1134 ajgajg1134 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions and comments here and there, looking pretty good!

@@ -144,6 +144,8 @@ type client struct {
// metrics are sent
metrics map[Namespace]map[string]*metric
newMetrics bool
// flush interval incremented every time flush() is called
interval int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere? It looks like it's always 0 in the tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not needed - I originally had this when I interpreted interval for Gauge to mean the flush interval - will remove it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add interval back somewhere appropriate when we discover a good use case for Gauge and Rate type metrics.

lievan and others added 10 commits April 18, 2023 09:55
Co-authored-by: Andrew Glaude <andrew.glaude@datadoghq.com>
Co-authored-by: Andrew Glaude <andrew.glaude@datadoghq.com>
Co-authored-by: Andrew Glaude <andrew.glaude@datadoghq.com>
Co-authored-by: Andrew Glaude <andrew.glaude@datadoghq.com>
Co-authored-by: Andrew Glaude <andrew.glaude@datadoghq.com>
@lievan lievan changed the base branch from main to shared/128-bit April 18, 2023 14:36
@lievan lievan requested review from a team as code owners April 18, 2023 14:36
@lievan lievan changed the base branch from shared/128-bit to main April 18, 2023 14:36
@lievan lievan removed request for a team April 18, 2023 14:37
Copy link
Contributor

@ajgajg1134 ajgajg1134 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there! just two small comments

client.Record(NamespaceTracers, MetricKindDist, "soobar", 1, nil, false)
client.Record(NamespaceTracers, MetricKindDist, "soobar", 3, nil, false)
client.mu.Lock()
client.flush()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't stop trigger a flush? So is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually meant to remove client.stop() here, thanks for pointing this out!
I wanted to replace stop with a direct call to flush so this unit test doesn't rely on the functionality of stop. Implementation of stop may change in the future because of our conversation about only sending app-closing when SIGINT and SIGQUIT are intercepted.

Copy link
Contributor

@ajgajg1134 ajgajg1134 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed QA notes!

@lievan lievan merged commit 58ba73d into main Apr 18, 2023
@lievan lievan deleted the evan.li/time-gauge-2 branch April 18, 2023 19:17
zARODz11z pushed a commit that referenced this pull request May 8, 2023
…ch as queuename tags

contrib: upgrade labstack/echo/v4 from v4.2.0 to v4.9.0 (#1891)

ci: fix flaky lint job (#1892)

contrib/elasticsearch: use naming schema (#1897)

ci: introduce golangci (#1898)

appsec: suspicious request blocking (#1797)

Co-authored-by: Julio Guerra <julio@datadog.com>

ci/golangci-lint: skip google.golang.org/grpc.v12 (#1899)

.github/workflows: run ASM and RC system-tests scenarios (#1900)

contrib/hashicorp/vault: use naming schema (#1868)

contrib/database/sql: add WithIgnoreQueryTypes option (#1823)

Co-authored-by: Zarir Hamza <zarir.hamza@datadoghq.com>
Co-authored-by: Rodrigo Argüello <rodrigo.arguello@datadoghq.com>

contrib/database/sql: use naming schema (#1895)

internal/appsec: add server.request.method address (#1893)

Signed-off-by: Eliott Bouhana <eliott.bouhana@datadoghq.com>
Co-authored-by: François Mazeau <francois.mazeau@datadoghq.com>

internal/appsec/dyngo: atomic instrumentation swapping (#1873)

Co-authored-by: François Mazeau <francois.mazeau@datadoghq.com>

go.mod: datadog-agent/pkg/remoteconfig/state 7.45.0-rc.1 (#1902)

internal/version: bump to v1.51.0 (#1912)

ddtrace/tracer: don't set empty tracestate propagation tag (#1910)

go.mod: github.com/DataDog/datadog-agent/pkg/obfuscate 7.45.0-rc.1 (#1916)

appsec: add blocking SDK body operation (#1901)

* Modifying the appsec api: adding appsec.MonitorParsedHTTPBody an error as return value
* Adding a call to the WAF to check for security event synchronously with a call to appsec.MonitorParsedHTTPBody on the body passed as parameter
* Removing the call to the WAF done on the body an the end of a request because we moved it.
* Refactoring the waf addresses storage and access

Signed-off-by: Eliott Bouhana <eliott.bouhana@datadoghq.com>

ddtrace/{opentelemetry,opentracer}: add telemetry (#1909)

internal/appsec: fix user ID event detection (#1918)

internal/telemetry: track tracer init time metric (#1896)

Co-authored-by: Andrew Glaude <andrew.glaude@datadoghq.com>

internal/appsec/remoteconfig: fix rules overrides (#1921)

contrib/mongodb: use naming schema (#1908)

contrib/syndtr/goleveldb/leveldb: use naming schema (#1914)

contrib/tidwall/buntdb: use naming schema (#1913)

internal/appsec: do not ignore the appsec events rate limiter (#1927)

remoteconfig: remove empty products and don't override appsec rules data (#1925)

contrib/kafka: refactor tests (#1907)

contrib/google.golang.org/grpc: use naming schema (#1919)

contrib/twitchtv/twirp: use naming schema (#1920)

contrib/http: use naming schema (#1929)

ddtrace/tracer: reset decision maker during fallback behavior of w3c header extraction (#1933)

contrib/cassandra: use naming schema (#1911)

Co-authored-by: Diana Shevchenko <40775148+dianashevchenko@users.noreply.github.com>

contrib/redis: use naming schema (#1906)

Co-authored-by: Andrew Glaude <andrew.glaude@datadoghq.com>

ci/system-tests: more scenarios with parallel jobs (#1938)

ci: update linter job and add bodyclose (#1942)

contrib/redis/go-redis.v9: support v9 (#1730)

Add support for new go-redis version v9.

It does 2 things:
Copy existing version 8 files to a new path, /redis/go-redis.v9.
Make changes to support version 9.

Fixes #1710

format and rerun go tidy

get rid of prints

add topLevelRegion assertions

remove confusing named return values and todo comment

ddtrace/tracer: ensure access to trace tags is concurrency-safe (#1948)

Spancontext marshaling was accessing tracer internal structures without a
lock, resulting in a data race and panic.

This commit adds a few methods to trace to allow safe access to the tags
and propagatingTags members of trace to the marshaling code.

Fixes #1944

ddtrace/tracer: mark context updated when SetUser is called (#1949)

Fixes a minor logic mistake when setting a user on a span

lint and add default switch case

refactor resourceNameKey and value assignments

restructure functions to be left aligned

use internal logger, be less verbose with function names

go back to normal switch type and format

Set keyTraceID128 on first span in the chunk only (#1946)

go.mod: upgrade go-libddwaf to v1.2.0 (#1953)

Co-authored-by: Julio Guerra <julio@datadog.com>

contrib/database/sql: fix bug where options were always overwritten by register options (#1904)

Co-authored-by: Diana Shevchenko <40775148+dianashevchenko@users.noreply.github.com>

ci/smoke-tests: update the go.sum file after go get -u (#1957)

contrib/net/http: don't set empty string values as span tags (#1956)

Do not set span fields when they are not configured so the tracer can put the defaults in.

use normal string then derefence

rever go.mod and go.sum changes

contrib/internal/httptrace: remove naming schema from init (#1960)

contrib/graphql: use naming schema (#1926)

internal/telemetry: trim the dependencies version prefix v (#1963)

contrib/aws: use naming schema (#1931)

contrib/cloud.google.com/go/pubsub.v1: use naming schema (#1937)

go mod tidy

lint and fix test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants