[CONTINT-4105] Support arbitrary container-ids to collect container metrics #25515

AliDatadog · 2024-05-10T16:04:37Z

What does this PR do?

This PR adds support of arbitrary container-id for containerd. We now can collect their container metrics and tags.

Motivation

Reduce the number of false negatives with a more robust solution to retrieve container-ids.

Additional Notes

RFC

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

Deploy the agent on a kind cluster.
Pull an image on the node with docker exec <node-id> ctr i pull docker.io/library/redis:latest
Start a redis container on the node with docker exec <node-id> ctr run docker.io/library/redis:latest redis
Make sure container metrics can be found with the right container-id (redis).
Notebook.

pr-commenter · 2024-05-10T17:11:58Z

Regression Detector

Regression Detector Results

Run ID: 4271b1b2-c1cf-4110-86f5-09a44468e8b3
Baseline: 5fa6bb3
Comparison: 7a2a229

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI
➖	tcp_syslog_to_blackhole	ingress throughput	+7.82	[-13.76, +29.40]
➖	basic_py_check	% cpu utilization	+1.49	[-0.94, +3.92]
➖	file_tree	memory utilization	+0.61	[+0.52, +0.71]
➖	otel_to_otel_logs	ingress throughput	+0.35	[-0.04, +0.73]
➖	idle	memory utilization	+0.03	[+0.00, +0.07]
➖	uds_dogstatsd_to_api	ingress throughput	+0.02	[-0.19, +0.22]
➖	trace_agent_json	ingress throughput	-0.00	[-0.01, +0.01]
➖	trace_agent_msgpack	ingress throughput	-0.01	[-0.01, -0.00]
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.02	[-0.05, +0.01]
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	-1.30	[-4.14, +1.54]
➖	pycheck_1000_100byte_tags	% cpu utilization	-2.82	[-7.59, +1.94]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

AliDatadog · 2024-05-13T10:25:54Z

/trigger-ci --variable RUN_ALL_BUILDS=true --variable RUN_KITCHEN_TESTS=true --variable RUN_E2E_TESTS=on --variable RUN_UNIT_TESTS=on --variable RUN_KMT_TESTS=on

dd-devflow · 2024-05-13T10:26:26Z

🚂 Gitlab pipeline started

Started pipeline #34122753

pr-commenter · 2024-05-13T13:10:36Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=34930504 --os-family=ubuntu

buraizu

Approving with a minor suggestion to the release note

releasenotes/notes/support-arbitrary-container-id-cc0efdf7c156b7ad.yaml

…b7ad.yaml Co-authored-by: Bryce Eadie <bryce.eadie@datadoghq.com>

comp/core/workloadmeta/collectors/internal/containerd/container_builder.go

comp/core/workloadmeta/types.go

vboulineau · 2024-05-14T10:44:10Z

pkg/util/containers/metrics/system/collector_linux.go

+	var w workloadmeta.Component
+	unwrapped, ok := wlm.Get()
+	if ok {
+		w = unwrapped


If not ok, you're passing nil to newContainerFilter. Perhaps you should fail and return an error instead.

In this case it's intentional. The start function returns nothing and the ContainerFilter will only call cgroups.ContainerFilter

pkg/util/containers/metrics/system/collector_linux.go

pkg/util/containers/metrics/system/filter_container.go

vboulineau · 2024-05-14T11:03:30Z

pkg/util/containers/metrics/system/filter_container.go

+			EventType: workloadmeta.EventTypeAll,
+		},
+	))
+	defer cf.wlm.Unsubscribe(evBundle)


This is actually never called

I thought that the channel could be closed by workloadmeta on shutdown. Removed the Unsubscribe for now.

vboulineau · 2024-05-14T11:04:27Z

pkg/util/containers/metrics/system/filter_container.go

+		return res, nil
+	}
+	cf.mutex.RLock()
+	res := cf.trie.Get(path)


As we have path matching, is the trie actually useful compared to a map?

As discussed, ContainerFilter is called with a full path while the workloadmeta object stores suffixes so we need to do suffix matching. I'll improve the doc and split the files.

pkg/util/containers/metrics/system/filter_container.go

vboulineau

Note that the implementation is racy by nature, as we depend on the subscriber to have done the work when ContainerFilter is called.
It will normally always converge after few seconds, but it should be noted.

cit-pr-commenter · 2024-05-14T12:52:39Z

Go Package Import Differences

Baseline: 5fa6bb3
Comparison: 7a2a229

binary	os	arch	change
agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie
agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie
iot-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie
iot-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie
heroku-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie
cluster-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie
cluster-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie
cluster-agent-cloudfoundry	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie
cluster-agent-cloudfoundry	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie
dogstatsd	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie
dogstatsd	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie
process-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie
process-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie
heroku-process-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie
security-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie
security-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie
trace-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie
trace-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie
heroku-trace-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/pkg/util/trie

…nly populate the Trie if the regex does not match

codecov · 2024-05-23T12:58:00Z

Codecov Report

Attention: Patch coverage is 75.67568% with 36 lines in your changes are missing coverage. Please review.

Project coverage is 48.74%. Comparing base (5fa6bb3) to head (7a2a229).
Report is 29 commits behind head on main.

Files	Patch %	Lines
.../util/containers/metrics/system/collector_linux.go	6.25%	14 Missing and 1 partial ⚠️
...util/containers/metrics/system/filter_container.go	83.01%	7 Missing and 2 partials ⚠️
pkg/util/trie/trie.go	89.83%	4 Missing and 2 partials ⚠️
pkg/proto/pbgo/core/workloadmeta.pb.go	0.00%	5 Missing ⚠️
pkg/util/cgroups/reader.go	50.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main   #25515       +/-   ##
===========================================
+ Coverage   45.17%   48.74%    +3.57%     
===========================================
  Files        2314     1760      -554     
  Lines      266564   165312   -101252     
===========================================
- Hits       120430    80589    -39841     
+ Misses     136547    79658    -56889     
+ Partials     9587     5065     -4522

Flag	Coverage Δ
amzn_aarch64	`48.97% <75.67%> (+3.15%)`	⬆️
centos_x86_64	`48.85% <75.67%> (+3.12%)`	⬆️
ubuntu_aarch64	`48.97% <75.67%> (+3.15%)`	⬆️
ubuntu_x86_64	`48.96% <75.67%> (+3.15%)`	⬆️
windows	`?`
windows_amd64	`51.27% <85.71%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Collect cgroup path in workloadmeta

7d1ec6b

AliDatadog added team/containers changelog/no-changelog labels May 10, 2024

AliDatadog added this to the 7.55.0 milestone May 10, 2024

AliDatadog force-pushed the ali/support-any-container branch 7 times, most recently from c7c47ae to 596d7ee Compare May 13, 2024 10:21

AliDatadog force-pushed the ali/support-any-container branch 2 times, most recently from 59bf823 to e702803 Compare May 13, 2024 11:53

AliDatadog added 4 commits May 13, 2024 14:32

Implement the container filter

13cea6f

Update TestDump

471f291

release note

be76d74

fix dogstatsd test using a mock workloadmeta with no implemented methods

f9e651b

AliDatadog force-pushed the ali/support-any-container branch from e702803 to f9e651b Compare May 13, 2024 12:32

AliDatadog marked this pull request as ready for review May 13, 2024 13:28

AliDatadog requested review from a team as code owners May 13, 2024 13:28

support the case of an empty cgroup path

1e3a1ea

buraizu approved these changes May 13, 2024

View reviewed changes

releasenotes/notes/support-arbitrary-container-id-cc0efdf7c156b7ad.yaml Outdated Show resolved Hide resolved

Update releasenotes/notes/support-arbitrary-container-id-cc0efdf7c156…

d3582ed

…b7ad.yaml Co-authored-by: Bryce Eadie <bryce.eadie@datadoghq.com>

vboulineau reviewed May 14, 2024

View reviewed changes

AliDatadog added 3 commits May 14, 2024 13:19

mention that the path is relative

96db666

Remove no-lint comments

75c985d

remove the unsubscribe

ef6801a

AliDatadog requested a review from a team as a code owner May 14, 2024 12:30

AliDatadog force-pushed the ali/support-any-container branch from 3fcd75e to 88671fe Compare May 14, 2024 12:38

AliDatadog requested a review from vboulineau May 14, 2024 13:18

AliDatadog force-pushed the ali/support-any-container branch 2 times, most recently from a83b759 to 315b4a0 Compare May 14, 2024 13:34

AliDatadog added 3 commits May 14, 2024 16:20

Move trie to its own package

1619b89

Add code comments to explain about suffix matching and make sure we o…

a1124ca

…nly populate the Trie if the regex does not match

add test cases for cgroupfs / systemd

ab12cf8

AliDatadog force-pushed the ali/support-any-container branch from 315b4a0 to ab12cf8 Compare May 14, 2024 14:20

hush-hush approved these changes May 15, 2024

View reviewed changes

vickenty approved these changes May 16, 2024

View reviewed changes

AliDatadog requested review from a team as code owners May 17, 2024 13:11

update protobuf

4defc85

AliDatadog force-pushed the ali/support-any-container branch from 556bb1a to 4defc85 Compare May 17, 2024 13:12

mellon85 approved these changes May 17, 2024

View reviewed changes

Merge branch 'main' into ali/support-any-container

7a2a229

robertjli approved these changes May 23, 2024

View reviewed changes

wdhif approved these changes May 23, 2024

View reviewed changes

ajgajg1134 approved these changes May 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CONTINT-4105] Support arbitrary container-ids to collect container metrics #25515

[CONTINT-4105] Support arbitrary container-ids to collect container metrics #25515

AliDatadog commented May 10, 2024 •

edited

pr-commenter bot commented May 10, 2024 •

edited

Fine details of change detection per experiment

Explanation

AliDatadog commented May 13, 2024

dd-devflow bot commented May 13, 2024

pr-commenter bot commented May 13, 2024 •

edited

buraizu left a comment

vboulineau May 14, 2024

AliDatadog May 14, 2024

vboulineau May 14, 2024

AliDatadog May 14, 2024

vboulineau May 14, 2024

AliDatadog May 14, 2024

vboulineau left a comment

cit-pr-commenter bot commented May 14, 2024 •

edited

codecov bot commented May 23, 2024 •

edited

[CONTINT-4105] Support arbitrary container-ids to collect container metrics #25515

Are you sure you want to change the base?

[CONTINT-4105] Support arbitrary container-ids to collect container metrics #25515

Conversation

AliDatadog commented May 10, 2024 • edited

What does this PR do?

Motivation

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

pr-commenter bot commented May 10, 2024 • edited

Regression Detector

Regression Detector Results

No significant changes in experiment optimization goals

Fine details of change detection per experiment

Explanation

AliDatadog commented May 13, 2024

dd-devflow bot commented May 13, 2024

pr-commenter bot commented May 13, 2024 • edited

Test changes on VM

buraizu left a comment

Choose a reason for hiding this comment

vboulineau May 14, 2024

Choose a reason for hiding this comment

AliDatadog May 14, 2024

Choose a reason for hiding this comment

vboulineau May 14, 2024

Choose a reason for hiding this comment

AliDatadog May 14, 2024

Choose a reason for hiding this comment

vboulineau May 14, 2024

Choose a reason for hiding this comment

AliDatadog May 14, 2024

Choose a reason for hiding this comment

vboulineau left a comment

Choose a reason for hiding this comment

cit-pr-commenter bot commented May 14, 2024 • edited

Go Package Import Differences

codecov bot commented May 23, 2024 • edited

Codecov Report

AliDatadog commented May 10, 2024 •

edited

pr-commenter bot commented May 10, 2024 •

edited

pr-commenter bot commented May 13, 2024 •

edited

cit-pr-commenter bot commented May 14, 2024 •

edited

codecov bot commented May 23, 2024 •

edited