Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Datadog agent causing RPM database get corrupted #24171

Open
rodehoed opened this issue Mar 28, 2024 · 11 comments
Open

[BUG] Datadog agent causing RPM database get corrupted #24171

rodehoed opened this issue Mar 28, 2024 · 11 comments

Comments

@rodehoed
Copy link

rodehoed commented Mar 28, 2024

Description
Ok i'm not 100% confident that is a Datadog issue, but it's the only clue I have right now. Since march 22th we see (10) servers with getting their RPM DB corrupted. The facts:

  • This only happens when the agent upgraded to 7.52.0-1
  • We only see it on machines with Datadog agent; the other 300 servers don't show this behaviour which all have the same base configuration but without DD agent.
  • Different cloud providers (just to rule some storage out or something)

Fixing the DB corruption will not prevent it from happening again. We have servers which have had this corruption multiple times now.

Agent Environment
The agent is running 7.52.0-1 on RHEL 8.9

Describe what happened:
The RPM database get's corrupted and calling the rpm/dnf command shows:

error: rpmdb: BDB0113 Thread/process 2421732/140117948610432 failed: BDB1507 Thread died in Berkeley DB library
error: db5 error(-30973) from dbenv->failchk: BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
error: cannot open Packages index using db5 -  (-30973)
error: cannot open Packages database in /var/lib/rpm
Error: Error: rpmdb open failed

Describe what you expected:
Database not getting corrupted

Steps to reproduce the issue:
Upgrading is enough, but don't know what triggers it.

Additional environment details (Operating System, Cloud provider, etc):

@paulcacheux
Copy link
Contributor

paulcacheux commented Mar 28, 2024

Hello ! Thanks for reporting this issue, would you mind sharing:

  • from which version you upgraded
  • how do you deploy the agent (a container ? using the install script ?)
  • your config, excluding secrets or the API key, I would like to understand which products you have enabled

Thanks a lot in advance

@rodehoed
Copy link
Author

Hi @paulcacheux ,

Sure np.

  • upgraded from 7.51.0-1.x86_64
  • The agent is installed as a Linux (rpm) agent and is managed by the Datadog Puppet Class; no container

The config comes from datadog-agent configcheck:

Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/container_image.d/conf.yaml.default
Config for instance ID: container_image:2ac6bde1700038e4
{}
~
Auto-discovery IDs:
* _container_image
===

=== container_lifecycle check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/container_lifecycle.d/conf.yaml.default
Config for instance ID: container_lifecycle:b628cf9ded5c9324
{}
~
Auto-discovery IDs:
* _container_lifecycle
===

=== cpu check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default
Config for instance ID: cpu:e331d61ed1323219
{}
~
===

=== disk check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/disk.d/conf.yaml.default
Config for instance ID: disk:67cc0574430a16ba
use_mount: false
~
===

=== file_handle check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/file_handle.d/conf.yaml.default
Config for instance ID: file_handle:381b8b6ca58d37b0
{}
~
===

=== io check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/io.d/conf.yaml.default
Config for instance ID: io:541b60d158de04a7
{}
~
===

=== load check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/load.d/conf.yaml.default
Config for instance ID: load:bf7cea93fb3aa780
{}
~
===

=== memory check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/memory.d/conf.yaml.default
Config for instance ID: memory:3f1f6288b95b9979
{}
~
===

=== mysql check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/mysql.d/conf.yaml
Config for instance ID: mysql:75cd0f7a0853706d
options:
  disable_innodb_metrics: false
  extra_innodb_metrics: true
  extra_performance_metrics: true
  extra_status_metrics: true
  galera_cluster: true
  replication: 0
  schema_size_metrics: false
pass: "********"
port: 3306
server: 127.0.0.1
user: datadog
~
===

=== network check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/network.d/conf.yaml.default
Config for instance ID: network:4b0649b7e11f0772
{}
~
===

=== nginx check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/nginx.d/conf.yaml
Config for instance ID: nginx:3833f3b9ceb3e496
nginx_status_url: http://not-my-host/nginx-status
~
Log Config:
logs:
- path: bogus/access.log
  service: staging.bogus.com
  source: nginx
  sourcecategory: http_web_access
  type: file
===

=== ntp check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/ntp.d/conf.yaml.default
Config for instance ID: ntp:3c427a42a70bbf8
{}
~
===

=== php_fpm check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/php_fpm.d/conf.yaml
Config for instance ID: php_fpm:5726203bab636eaa
http_host: bogus-host
ping_reply: pong
ping_url: http://127.0.0.1/ping
status_url: http://127.0.0.1/fpmstatus
use_fastcgi: false
~
===

=== telemetry check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/telemetry.d/conf.yaml.default
Config for instance ID: telemetry:4d459fc318a47aa4
{}
~
===

=== uptime check ===
Configuration provider: file
Configuration source: file:/etc/datadog-agent/conf.d/uptime.d/conf.yaml.default
Config for instance ID: uptime:c72f390abdefdf1a
{}
~
===``



@paulcacheux
Copy link
Contributor

Could you share the following files if present:

/etc/datadog-agent/datadog.yaml
/etc/datadog-agent/system-probe.yaml
/etc/datadog-agent/security-agent.yaml

Thanks a lot !

@rodehoed
Copy link
Author

rodehoed commented Mar 28, 2024

sure:

### MANAGED BY PUPPET
---
api_key: xxxxxxxxxxxxxx
dd_url: ''
site: datadoghq.eu
cmd_port: 5001
hostname_fqdn: false
collect_ec2_tags: false
collect_gce_tags: false
confd_path: "/etc/datadog-agent/conf.d"
enable_metadata_collection: true
dogstatsd_port: 8125
dogstatsd_socket: ''
dogstatsd_non_local_traffic: false
log_file: "/var/log/datadog/agent.log"
log_level: info
tags: []
apm_config:
  enabled: true
  env: none
  apm_non_local_traffic: false
process_config:
  enabled: 'true'
  scrub_args: true
  custom_sensitive_words: []
logs_enabled: true
logs_config:
  container_collect_all: false

The system-probe and security agent config are not active.

@chouetz
Copy link
Contributor

chouetz commented Mar 29, 2024

Hello,
The latest agent version comes with a new telemetry that reads data from rpm. To see if this one is the culprit, could you please try to disable it by setting

enable_signing_metadata_collection: false

in your datadog.yaml configuration and restart the Agent? Then fix the DB corruption and see if it stops this from happening?
Thanks in advance

@rodehoed
Copy link
Author

rodehoed commented Apr 2, 2024

Hi All,

As of today, this config is set. I will keep you posted.

@Pythyu
Copy link
Contributor

Pythyu commented Apr 15, 2024

Hi 👋
Just a quick follow-up if you have any updates with the config option. Does the DB corruption still happens ?
Thanks in advance

@rodehoed
Copy link
Author

Hi @Pythyu

Well not any updates actually :-) I mean we don't have seen this message anymore the last weeks. So one might think the problem is "fixed".

@Pythyu
Copy link
Contributor

Pythyu commented Apr 15, 2024

Thanks you for all the answers 😃
Could you contact our support so we can get more information about your environment through not github ?
It would help us a lot to reproduce the issue and potentially test the bug fix.
You can share the ticket support id here, we'll follow it up

@Pythyu
Copy link
Contributor

Pythyu commented Apr 22, 2024

Hi @rodehoed 👋
Please let us know if you got in touch with our support 😃
Thanks

@rodehoed
Copy link
Author

rodehoed commented May 1, 2024

Hi All,

Sorry for being late! I opened a ticket right now at DD with ticket id #689248

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants