Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking: Address Clustering Issues #784

Open
1 of 9 tasks
thampiotr opened this issue May 7, 2024 · 4 comments
Open
1 of 9 tasks

Tracking: Address Clustering Issues #784

thampiotr opened this issue May 7, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@thampiotr
Copy link
Contributor

thampiotr commented May 7, 2024

Request

There are a few issues that users report and we're observing, that can lead to data issues: 1) there can be gaps in metrics under some circumstances when instances join cluster, 2) there can be elevated errors and alerts when writing to TSDB in some cases, 3) there can be duplicated metrics in other cases.

The extent of these issues is not high, but since these are potential data loss issues, we want to address them and fully understand the problem.

Use case

Sending data should not be dropped.

Tasks

  1. bug enhancement
    thampiotr
  2. bug needs-attention variant/flow
@thampiotr thampiotr added the enhancement New feature or request label May 7, 2024
@thampiotr thampiotr self-assigned this May 7, 2024
@diguardiag
Copy link

diguardiag commented May 8, 2024

I am noticing this behaviour on a k8s cluster (~1800 pods), with an alloy cluster of 3, Istio present, and pod autodiscovery enabled.

  • Pods have 13GB Memory each (upper limit)
  • Peers change constantly (but they are always up and running) (screen below)

We're experiencing:

  • metrics data loss
  • a lot of lines with Dropped sample for series that was not explicitly dropped via relabelling
  • cluster pods not agreeing on who is part of the cluster (see screen)
    screenshot_2024-05-08_at_10 46 42

@gowtham-sundara
Copy link

I noticed an issue today where one of the pods fell out of the clustering, it's present in discovery but none of the alloy pods actually scrape them. This didn't go away over a long period of time so I am not sure if it's related to #1 mentioned.

@christopher-wong
Copy link

This is my Helm configuration for my deployment of Alloy (24 nodes, so 24 Alloy pods). When I enable clustering, alloy.clustering.enabled = true, metrics stop being scraped altogether.

alloy:
  configMap:
    content: |-
      prometheus.remote_write "default" {
        endpoint {
          url = "http://mimir-gateway.monitoring.svc:80/api/v1/push"
        }
      }

      prometheus.operator.servicemonitors "services" {
        forward_to = [prometheus.remote_write.default.receiver]

        clustering {
          enabled = true
        }
      }

      prometheus.operator.podmonitors "pods" {
        forward_to = [prometheus.remote_write.default.receiver]

        clustering {
          enabled = true
        }
      }
  clustering:
    enabled: false      
  resources:
    requests:
      cpu: 100m
      memory: 2Gi
    limits:
      cpu: 1.5
      memory: 12Gi
configReloader:
  resources:
    requests:
      cpu: "1m"
      memory: "5Mi"
    limits:
      cpu: 10m
      memory: 10Mi

@itjobs-levi
Copy link

I have Alloy Agent replica 3 ea (CPU: 1000m / Memory: 4Gi)
Enable clustering mode (both scrape exporters)
Configure to scrape the unix exporter and process exporter on about 200 servers at one-minute intervals.
When scraping, many errors such as err-mimir-duplicate-label-names occur in mimir.
In the Grafana mimir document, err-mimir-duplicate-label-names appears to be a problem caused by existing records.
I think this is caused by cluster splitting job load balancing.

The first is that this feels like a load, is that correct?
If it's not a secondary load, is it possible to turn off these logs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants