Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus_remote_write memory leak #20470

Open
rightly opened this issue May 10, 2024 · 1 comment
Open

prometheus_remote_write memory leak #20470

rightly opened this issue May 10, 2024 · 1 comment
Labels
domain: performance Anything related to Vector's performance domain: reliability Anything related to Vector's reliability sink: prometheus_remote_write Anything `prometheus_remote_write` sink related type: bug A code related bug.

Comments

@rightly
Copy link

rightly commented May 10, 2024

A note for the community

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

The vector's prometheus_remote_write appears to cause memory leak.

  • memory usage:
    image

  • Total memory usage by component
    image

    • Prometheus_remote_write is increasing linearly.
      image

I am running a vector separately that parses logs and creates metrics, and a vector that sends logs to loki.
Among them, only vector pods that prometheus_remote_write are experiencing OOM as above.

Please help me on how to track this issue.

Configuration

  data_dir: /vector-data-dir
  expire_metrics_secs: 300
  api:
    enabled: true
    address: 0.0.0.0:8686
  sources:
    vector_metrics:
      type: internal_metrics
    http_input:
      type: http_server
      address: 0.0.0.0:9090
      path: /es
      auth:
        username: user
        password: pw
      decoding:
        codec: bytes
      keepalive:
        max_connection_age_secs: 60

  transforms:
    unnest_remap:
      type: "remap"
      inputs: ["http_input"]
      source: |
        . = parse_json!(string!(.message))
    json_remap:
      type: remap
      inputs: ["unnest_remap"]
      source: |
        if .maxAgeSec != "-" {
          .maxAgeSec = to_int!(.maxAgeSec)
        } else {
          .maxAgeSec = 0
        }

        .platform = "akamai"
        if .country == .serverCountry {
          .inCountry = true
        } else {
          .inCountry = false
        }

        if exists(.reqTimeSec) {
          .timestamp = to_float!(.reqTimeSec)
        }

        # parsing ipv4, ipv6
        if exists(.cliIP) {
          if is_ipv4!(.cliIP) {
            .ipVersion = "ipv4"
          } else if is_ipv6!(.cliIP) {
            .ipVersion = "ipv6"
          } else {
            .ipVersion = "unknown"
          }
        }

        # init cache status
        .cacheHit = "false"
        .cacheHitLayer = "false"

        # parsing edge cache status
        if exists(.cacheStatus) {
          if .cacheStatus != "0" {
            .cacheHit = "true"
            .cacheHitLayer = "edge"
          }
        }

        if exists(.isMidgress) {
          if .isMidgress != "0" {
            .cacheHit = "true"
            .cacheHitLayer = "midgress"
          }

          if .cacheStatus == "0" && .isMidgress != "0" {
            .cacheHit = "true"
            .cacheHitLayer = "midgress_only"
          }
        }

    json_metric:
      type: log_to_metric
      inputs: ["json_remap"]
      metrics:
        - type: counter
          field: statusCode
          namespace: datastream
          name: http_response_total
          tags:
            job: log_to_metric
            hostname: '{{ "{{reqHost}}" }}'
            country: '{{ "{{country}}" }}'
            method: '{{ "{{reqMethod}}" }}'
            status_code: '{{ "{{statusCode}}" }}'

        - type: counter
          field: cacheHit
          namespace: datastream
          name: http_cache_status_total
          tags:
            job: log_to_metric
            hostname: '{{ "{{reqHost}}" }}'
            country: '{{ "{{country}}" }}'
            cache_hit: '{{ "{{cacheHit}}" }}'
            cache_layer: '{{ "{{cacheHitLayer}}" }}'

        - type: counter
          field: ipVersion
          namespace: datastream
          name: http_ip_version_total
          tags:
            job: log_to_metric
            hostname: '{{ "{{reqHost}}" }}'
            country: '{{ "{{country}}" }}'
            ip_version: '{{ "{{ipVersion}}" }}'

        - type: gauge
          field: maxAgeSec
          name: http_max_age_second
          tags:
            job: log_to_metric
            hostname: '{{ "{{reqHost}}" }}'

        - type: summary
          field: throughput
          namespace: datastream
          name: http_throughput_kbps
          increment_by_value: true
          tags:
            job: log_to_metric
            hostname: '{{ "{{reqHost}}" }}'
            country: '{{ "{{country}}" }}'
            cache_hit: '{{ "{{cacheHit}}" }}'
            cache_layer: '{{ "{{cacheHitLayer}}" }}'
          # ..... and so on
    metric_remap:
      type: remap
      inputs: ["json_metric"]
      source: |
        .tags.forwarder = get_hostname!()
    metric_aggregate:
      type: aggregate
      inputs: ["metric_remap", "vector_metrics"]
      interval_ms: 60000 # 60s

  sinks:
    metric_write:
      type: prometheus_remote_write
      inputs: ["metric_aggregate"]
      endpoint: endpoint1
      compression: snappy
      auth:
        strategy: basic
        user: user
        password: pw
      batch:
        max_events: 32
        timeout_secs: 1
      buffer:
        type: disk
        max_size: 2147483648 # 2GiB
        when_full: block
    kafka_sink:
      type: kafka
      inputs: ["json_remap"]
      bootstrap_servers: server1
      topic: datastream
      batch:
        max_events: 1024
        timeout_secs: 5
      buffer:
        type: disk
        max_size: 5368709120 # 5GiB
        when_full: drop_newest # default block
      compression: snappy
      encoding:
        codec: json

Version

0.38.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

@rightly rightly added the type: bug A code related bug. label May 10, 2024
@rightly rightly changed the title Log to metric memory leak Log to metric process memory leak May 10, 2024
@rightly rightly changed the title Log to metric process memory leak metric write sink memory leak May 10, 2024
@rightly rightly changed the title metric write sink memory leak prometheus_remote_write memory leak May 10, 2024
@jszwedko jszwedko added domain: reliability Anything related to Vector's reliability domain: performance Anything related to Vector's performance sink: prometheus_remote_write Anything `prometheus_remote_write` sink related labels May 10, 2024
@jszwedko
Copy link
Member

Thanks for the detailed report! That does indeed look like a memory leak.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: performance Anything related to Vector's performance domain: reliability Anything related to Vector's reliability sink: prometheus_remote_write Anything `prometheus_remote_write` sink related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

2 participants