prometheus_remote_write memory leak #20470

rightly · 2024-05-10T01:14:12Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

The vector's prometheus_remote_write appears to cause memory leak.

memory usage:
Total memory usage by component
- Prometheus_remote_write is increasing linearly.

I am running a vector separately that parses logs and creates metrics, and a vector that sends logs to loki.
Among them, only vector pods that prometheus_remote_write are experiencing OOM as above.

Please help me on how to track this issue.

Configuration

  data_dir: /vector-data-dir
  expire_metrics_secs: 300
  api:
    enabled: true
    address: 0.0.0.0:8686
  sources:
    vector_metrics:
      type: internal_metrics
    http_input:
      type: http_server
      address: 0.0.0.0:9090
      path: /es
      auth:
        username: user
        password: pw
      decoding:
        codec: bytes
      keepalive:
        max_connection_age_secs: 60

  transforms:
    unnest_remap:
      type: "remap"
      inputs: ["http_input"]
      source: |
        . = parse_json!(string!(.message))
    json_remap:
      type: remap
      inputs: ["unnest_remap"]
      source: |
        if .maxAgeSec != "-" {
          .maxAgeSec = to_int!(.maxAgeSec)
        } else {
          .maxAgeSec = 0
        }

        .platform = "akamai"
        if .country == .serverCountry {
          .inCountry = true
        } else {
          .inCountry = false
        }

        if exists(.reqTimeSec) {
          .timestamp = to_float!(.reqTimeSec)
        }

        # parsing ipv4, ipv6
        if exists(.cliIP) {
          if is_ipv4!(.cliIP) {
            .ipVersion = "ipv4"
          } else if is_ipv6!(.cliIP) {
            .ipVersion = "ipv6"
          } else {
            .ipVersion = "unknown"
          }
        }

        # init cache status
        .cacheHit = "false"
        .cacheHitLayer = "false"

        # parsing edge cache status
        if exists(.cacheStatus) {
          if .cacheStatus != "0" {
            .cacheHit = "true"
            .cacheHitLayer = "edge"
          }
        }

        if exists(.isMidgress) {
          if .isMidgress != "0" {
            .cacheHit = "true"
            .cacheHitLayer = "midgress"
          }

          if .cacheStatus == "0" && .isMidgress != "0" {
            .cacheHit = "true"
            .cacheHitLayer = "midgress_only"
          }
        }

    json_metric:
      type: log_to_metric
      inputs: ["json_remap"]
      metrics:
        - type: counter
          field: statusCode
          namespace: datastream
          name: http_response_total
          tags:
            job: log_to_metric
            hostname: '{{ "{{reqHost}}" }}'
            country: '{{ "{{country}}" }}'
            method: '{{ "{{reqMethod}}" }}'
            status_code: '{{ "{{statusCode}}" }}'

        - type: counter
          field: cacheHit
          namespace: datastream
          name: http_cache_status_total
          tags:
            job: log_to_metric
            hostname: '{{ "{{reqHost}}" }}'
            country: '{{ "{{country}}" }}'
            cache_hit: '{{ "{{cacheHit}}" }}'
            cache_layer: '{{ "{{cacheHitLayer}}" }}'

        - type: counter
          field: ipVersion
          namespace: datastream
          name: http_ip_version_total
          tags:
            job: log_to_metric
            hostname: '{{ "{{reqHost}}" }}'
            country: '{{ "{{country}}" }}'
            ip_version: '{{ "{{ipVersion}}" }}'

        - type: gauge
          field: maxAgeSec
          name: http_max_age_second
          tags:
            job: log_to_metric
            hostname: '{{ "{{reqHost}}" }}'

        - type: summary
          field: throughput
          namespace: datastream
          name: http_throughput_kbps
          increment_by_value: true
          tags:
            job: log_to_metric
            hostname: '{{ "{{reqHost}}" }}'
            country: '{{ "{{country}}" }}'
            cache_hit: '{{ "{{cacheHit}}" }}'
            cache_layer: '{{ "{{cacheHitLayer}}" }}'
          # ..... and so on
    metric_remap:
      type: remap
      inputs: ["json_metric"]
      source: |
        .tags.forwarder = get_hostname!()
    metric_aggregate:
      type: aggregate
      inputs: ["metric_remap", "vector_metrics"]
      interval_ms: 60000 # 60s

  sinks:
    metric_write:
      type: prometheus_remote_write
      inputs: ["metric_aggregate"]
      endpoint: endpoint1
      compression: snappy
      auth:
        strategy: basic
        user: user
        password: pw
      batch:
        max_events: 32
        timeout_secs: 1
      buffer:
        type: disk
        max_size: 2147483648 # 2GiB
        when_full: block
    kafka_sink:
      type: kafka
      inputs: ["json_remap"]
      bootstrap_servers: server1
      topic: datastream
      batch:
        max_events: 1024
        timeout_secs: 5
      buffer:
        type: disk
        max_size: 5368709120 # 5GiB
        when_full: drop_newest # default block
      compression: snappy
      encoding:
        codec: json

Version

0.38.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

The text was updated successfully, but these errors were encountered:

jszwedko · 2024-05-10T15:51:08Z

Thanks for the detailed report! That does indeed look like a memory leak.

rightly added the type: bug A code related bug. label May 10, 2024

rightly changed the title ~~Log to metric memory leak~~ Log to metric process memory leak May 10, 2024

rightly changed the title ~~Log to metric process memory leak~~ metric write sink memory leak May 10, 2024

rightly changed the title ~~metric write sink memory leak~~ prometheus_remote_write memory leak May 10, 2024

jszwedko added domain: reliability Anything related to Vector's reliability domain: performance Anything related to Vector's performance sink: prometheus_remote_write Anything `prometheus_remote_write` sink related labels May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prometheus_remote_write memory leak #20470

prometheus_remote_write memory leak #20470

rightly commented May 10, 2024 •

edited

jszwedko commented May 10, 2024

prometheus_remote_write memory leak #20470

prometheus_remote_write memory leak #20470

Comments

rightly commented May 10, 2024 • edited

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

jszwedko commented May 10, 2024

rightly commented May 10, 2024 •

edited