Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CONTINT-4105] Support arbitrary container-ids to collect container metrics #25515

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

AliDatadog
Copy link
Contributor

@AliDatadog AliDatadog commented May 10, 2024

What does this PR do?

This PR adds support of arbitrary container-id for containerd. We now can collect their container metrics and tags.

Motivation

Reduce the number of false negatives with a more robust solution to retrieve container-ids.

Additional Notes

RFC

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

Deploy the agent on a kind cluster.
Pull an image on the node with docker exec <node-id> ctr i pull docker.io/library/redis:latest
Start a redis container on the node with docker exec <node-id> ctr run docker.io/library/redis:latest redis
Make sure container metrics can be found with the right container-id (redis).
Notebook.

@pr-commenter
Copy link

pr-commenter bot commented May 10, 2024

Regression Detector

Regression Detector Results

Run ID: 4271b1b2-c1cf-4110-86f5-09a44468e8b3
Baseline: 5fa6bb3
Comparison: 7a2a229

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI
tcp_syslog_to_blackhole ingress throughput +7.82 [-13.76, +29.40]
basic_py_check % cpu utilization +1.49 [-0.94, +3.92]
file_tree memory utilization +0.61 [+0.52, +0.71]
otel_to_otel_logs ingress throughput +0.35 [-0.04, +0.73]
idle memory utilization +0.03 [+0.00, +0.07]
uds_dogstatsd_to_api ingress throughput +0.02 [-0.19, +0.22]
trace_agent_json ingress throughput -0.00 [-0.01, +0.01]
trace_agent_msgpack ingress throughput -0.01 [-0.01, -0.00]
tcp_dd_logs_filter_exclude ingress throughput -0.02 [-0.05, +0.01]
uds_dogstatsd_to_api_cpu % cpu utilization -1.30 [-4.14, +1.54]
pycheck_1000_100byte_tags % cpu utilization -2.82 [-7.59, +1.94]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

@AliDatadog AliDatadog force-pushed the ali/support-any-container branch 7 times, most recently from c7c47ae to 596d7ee Compare May 13, 2024 10:21
@AliDatadog
Copy link
Contributor Author

/trigger-ci --variable RUN_ALL_BUILDS=true --variable RUN_KITCHEN_TESTS=true --variable RUN_E2E_TESTS=on --variable RUN_UNIT_TESTS=on --variable RUN_KMT_TESTS=on

@dd-devflow
Copy link

dd-devflow bot commented May 13, 2024

🚂 Gitlab pipeline started

Started pipeline #34122753

@AliDatadog AliDatadog force-pushed the ali/support-any-container branch 2 times, most recently from 59bf823 to e702803 Compare May 13, 2024 11:53
@pr-commenter
Copy link

pr-commenter bot commented May 13, 2024

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=34930504 --os-family=ubuntu

@AliDatadog AliDatadog marked this pull request as ready for review May 13, 2024 13:28
@AliDatadog AliDatadog requested review from a team as code owners May 13, 2024 13:28
Copy link
Contributor

@buraizu buraizu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with a minor suggestion to the release note

…b7ad.yaml

Co-authored-by: Bryce Eadie <bryce.eadie@datadoghq.com>
comp/core/workloadmeta/types.go Show resolved Hide resolved
comp/core/workloadmeta/types.go Outdated Show resolved Hide resolved
var w workloadmeta.Component
unwrapped, ok := wlm.Get()
if ok {
w = unwrapped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not ok, you're passing nil to newContainerFilter. Perhaps you should fail and return an error instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it's intentional. The start function returns nothing and the ContainerFilter will only call cgroups.ContainerFilter

pkg/util/containers/metrics/system/collector_linux.go Outdated Show resolved Hide resolved
pkg/util/containers/metrics/system/filter_container.go Outdated Show resolved Hide resolved
EventType: workloadmeta.EventTypeAll,
},
))
defer cf.wlm.Unsubscribe(evBundle)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually never called

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that the channel could be closed by workloadmeta on shutdown. Removed the Unsubscribe for now.

return res, nil
}
cf.mutex.RLock()
res := cf.trie.Get(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we have path matching, is the trie actually useful compared to a map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, ContainerFilter is called with a full path while the workloadmeta object stores suffixes so we need to do suffix matching. I'll improve the doc and split the files.

pkg/util/containers/metrics/system/filter_container.go Outdated Show resolved Hide resolved
Copy link
Contributor

@vboulineau vboulineau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the implementation is racy by nature, as we depend on the subscriber to have done the work when ContainerFilter is called.
It will normally always converge after few seconds, but it should be noted.

@AliDatadog AliDatadog requested a review from a team as a code owner May 14, 2024 12:30
Copy link

cit-pr-commenter bot commented May 14, 2024

Go Package Import Differences

Baseline: 5fa6bb3
Comparison: 7a2a229

binaryosarchchange
agentlinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie
agentlinuxarm64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie
iot-agentlinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie
iot-agentlinuxarm64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie
heroku-agentlinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie
cluster-agentlinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie
cluster-agentlinuxarm64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie
cluster-agent-cloudfoundrylinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie
cluster-agent-cloudfoundrylinuxarm64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie
dogstatsdlinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie
dogstatsdlinuxarm64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie
process-agentlinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie
process-agentlinuxarm64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie
heroku-process-agentlinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie
security-agentlinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie
security-agentlinuxarm64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie
trace-agentlinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie
trace-agentlinuxarm64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie
heroku-trace-agentlinuxamd64
+1, -0
+github.com/DataDog/datadog-agent/pkg/util/trie

@AliDatadog AliDatadog force-pushed the ali/support-any-container branch 2 times, most recently from a83b759 to 315b4a0 Compare May 14, 2024 13:34
@AliDatadog AliDatadog requested review from a team as code owners May 17, 2024 13:11
Copy link

codecov bot commented May 23, 2024

Codecov Report

Attention: Patch coverage is 75.67568% with 36 lines in your changes are missing coverage. Please review.

Project coverage is 48.74%. Comparing base (5fa6bb3) to head (7a2a229).
Report is 29 commits behind head on main.

Files Patch % Lines
.../util/containers/metrics/system/collector_linux.go 6.25% 14 Missing and 1 partial ⚠️
...util/containers/metrics/system/filter_container.go 83.01% 7 Missing and 2 partials ⚠️
pkg/util/trie/trie.go 89.83% 4 Missing and 2 partials ⚠️
pkg/proto/pbgo/core/workloadmeta.pb.go 0.00% 5 Missing ⚠️
pkg/util/cgroups/reader.go 50.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #25515       +/-   ##
===========================================
+ Coverage   45.17%   48.74%    +3.57%     
===========================================
  Files        2314     1760      -554     
  Lines      266564   165312   -101252     
===========================================
- Hits       120430    80589    -39841     
+ Misses     136547    79658    -56889     
+ Partials     9587     5065     -4522     
Flag Coverage Δ
amzn_aarch64 48.97% <75.67%> (+3.15%) ⬆️
centos_x86_64 48.85% <75.67%> (+3.12%) ⬆️
ubuntu_aarch64 48.97% <75.67%> (+3.15%) ⬆️
ubuntu_x86_64 48.96% <75.67%> (+3.15%) ⬆️
windows ?
windows_amd64 51.27% <85.71%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants