Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a feature to prometheusexporter to export exemplars along with… #9945

Merged
merged 6 commits into from Jun 28, 2022

Conversation

arun-shopify
Copy link
Contributor

@arun-shopify arun-shopify commented May 10, 2022

Description:
Adding a feature to prometheusexporter to export exemplars along with histogram metrics. In order to add this feature to the prometheusexporter, we made a PR (prometheus/client_golang#986, merged) to add exemplars support to the prometheus/client_golang library.

Exemplars only show up in Open Metrics format, so I have added a config option EnableOpenMetrics in prometheus request handler that can be used to toggle the output format.

Testing:

Example config for testing the prometheusexporter in openetelemerty-collector:

receivers:
  nop: {}
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:
  spanmetrics:
    metrics_exporter: prometheus

exporters:
  nop: {}
  prometheus:
    endpoint: "localhost:8889"
    namespace: "test"
    enable_open_metrics: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [spanmetrics]
      exporters: [nop]
    metrics:
      receivers: [nop]
      exporters: [prometheus]

To view the exemplars from prometheus exporter, you have to set the following header in your request.

Accept: application/openmetrics-text;

Example curl command:

curl -H 'Accept: application/openmetrics-text; version=0.0.1' http://localhost:8889/metrics

Sample output for one of the buckets:

test_latency_bucket{operation="TestSubSpan",service_name="tracegen",span_kind="SPAN_KIND_INTERNAL",status_code="STATUS_CODE_UNSET",le="4.0"} 140 1.645646235482e+09 # {trace_id="739d03769cc2b850576987fcec20c058"} 0.000583 1.645646235482858e+09

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented May 10, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: arun-shopify / name: Arun Mahendra (d11fea1)

@arun-shopify arun-shopify reopened this May 11, 2022
@arun-shopify arun-shopify marked this pull request as ready for review May 11, 2022 17:15
@arun-shopify arun-shopify requested review from a team and Aneurysm9 as code owners May 11, 2022 17:15
mx-psi
mx-psi previously requested changes May 12, 2022
exporter/prometheusexporter/go.mod Outdated Show resolved Hide resolved
@arun-shopify

This comment was marked as resolved.

@mx-psi mx-psi dismissed their stale review May 13, 2022 07:41

Blocking issue was addressed

@mx-psi

This comment was marked as resolved.

@arun-shopify

This comment was marked as resolved.

@arun-shopify

This comment was marked as resolved.

@arun-shopify

This comment was marked as resolved.

@mx-psi

This comment was marked as resolved.

@arun-shopify

This comment was marked as resolved.

@mx-psi

This comment was marked as resolved.

@arun-shopify arun-shopify requested a review from mx-psi May 30, 2022 19:41
@arun-shopify

This comment was marked as resolved.

go.mod Show resolved Hide resolved
Copy link
Member

@mx-psi mx-psi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the changes! I think you will need a changelog entry for this to pass. I also left a small (non-blocking) comment to know your opinion

exporter/prometheusexporter/config.go Show resolved Hide resolved
@mx-psi

This comment was marked as resolved.

@mx-psi
Copy link
Member

mx-psi commented Jun 28, 2022

Thanks for your patience @arun-shopify, sorry it took so long to merge this.

@alvinhom
Copy link

Hi, I was waiting to test the exemplar support from the prometheus exporter. So, I did a local build with this PR. When I restarted my collector, I am seeing the following error in the collector logs:

2022/06/28 16:48:50 http: panic serving 10.42.15.30:39238: no bucket was found for given exemplar value
goroutine 184 [running]:
net/http.(*conn).serve.func1()
net/http/server.go:1825 +0xbf
panic({0x453d2e0, 0x5da0920})
runtime/panic.go:844 +0x258
github.com/prometheus/client_golang/prometheus.(*withExemplarsMetric).Write(0xc001504570, 0xc000234770)
github.com/prometheus/client_golang@v1.12.2/prometheus/metric.go:188 +0x1c5
github.com/prometheus/client_golang/prometheus.processMetric({0x5dc3a98, 0xc001504570}, 0xc00098e180?, 0x0?, 0x0)
github.com/prometheus/client_golang@v1.12.2/prometheus/registry.go:605 +0x98
github.com/prometheus/client_golang/prometheus.(*Registry).Gather(0xc0009b3220)
github.com/prometheus/client_golang@v1.12.2/prometheus/registry.go:509 +0x71f
github.com/prometheus/client_golang/prometheus.(*noTransactionGatherer).Gather(0x8?)
github.com/prometheus/client_golang@v1.12.2/prometheus/registry.go:1041 +0x22
github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1({0x5de35c8, 0xc0013f08c0}, 0xc001c4a700)
github.com/prometheus/client_golang@v1.12.2/prometheus/promhttp/http.go:133 +0xfe
net/http.HandlerFunc.ServeHTTP(0x7f1fcb879fc0?, {0x5de35c8?, 0xc0013f08c0?}, 0x40da65?)
net/http/server.go:2084 +0x2f
net/http.(*ServeMux).ServeHTTP(0x0?, {0x5de35c8, 0xc0013f08c0}, 0xc001c4a700)
net/http/server.go:2462 +0x149
net/http.serverHandler.ServeHTTP({0xc0013b5d10?}, {0x5de35c8, 0xc0013f08c0}, 0xc001c4a700)
net/http/server.go:2916 +0x43b
net/http.(*conn).serve(0xc000475400, {0x5de5d88, 0xc0004553b0})
net/http/server.go:1966 +0x5d7
created by net/http.(*Server).Serve
net/http/server.go:3071 +0x4db

If I curl the metrics endpoint, it returns with empty data. My pipeline goes like this:

Java Agent (otel exporter trace/metrics) -> otel collector (prometheus exporter) -> prometheus

The metrics was working fine previously, however it was not getting exemplar obviously.

I tried enable and disabling the "enable_open_metrics" flag, but it does not seem to matter and I get the same error.

@arun-shopify
Copy link
Contributor Author

Hi, I was waiting to test the exemplar support from the prometheus exporter. So, I did a local build with this PR. When I restarted my collector, I am seeing the following error in the collector logs:

2022/06/28 16:48:50 http: panic serving 10.42.15.30:39238: no bucket was found for given exemplar value goroutine 184 [running]: net/http.(*conn).serve.func1() net/http/server.go:1825 +0xbf panic({0x453d2e0, 0x5da0920}) runtime/panic.go:844 +0x258 github.com/prometheus/client_golang/prometheus.(*withExemplarsMetric).Write(0xc001504570, 0xc000234770) github.com/prometheus/client_golang@v1.12.2/prometheus/metric.go:188 +0x1c5 github.com/prometheus/client_golang/prometheus.processMetric({0x5dc3a98, 0xc001504570}, 0xc00098e180?, 0x0?, 0x0) github.com/prometheus/client_golang@v1.12.2/prometheus/registry.go:605 +0x98 github.com/prometheus/client_golang/prometheus.(*Registry).Gather(0xc0009b3220) github.com/prometheus/client_golang@v1.12.2/prometheus/registry.go:509 +0x71f github.com/prometheus/client_golang/prometheus.(*noTransactionGatherer).Gather(0x8?) github.com/prometheus/client_golang@v1.12.2/prometheus/registry.go:1041 +0x22 github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1({0x5de35c8, 0xc0013f08c0}, 0xc001c4a700) github.com/prometheus/client_golang@v1.12.2/prometheus/promhttp/http.go:133 +0xfe net/http.HandlerFunc.ServeHTTP(0x7f1fcb879fc0?, {0x5de35c8?, 0xc0013f08c0?}, 0x40da65?) net/http/server.go:2084 +0x2f net/http.(*ServeMux).ServeHTTP(0x0?, {0x5de35c8, 0xc0013f08c0}, 0xc001c4a700) net/http/server.go:2462 +0x149 net/http.serverHandler.ServeHTTP({0xc0013b5d10?}, {0x5de35c8, 0xc0013f08c0}, 0xc001c4a700) net/http/server.go:2916 +0x43b net/http.(*conn).serve(0xc000475400, {0x5de5d88, 0xc0004553b0}) net/http/server.go:1966 +0x5d7 created by net/http.(*Server).Serve net/http/server.go:3071 +0x4db

If I curl the metrics endpoint, it returns with empty data. My pipeline goes like this:

Java Agent (otel exporter trace/metrics) -> otel collector (prometheus exporter) -> prometheus

The metrics was working fine previously, however it was not getting exemplar obviously.

I tried enable and disabling the "enable_open_metrics" flag, but it does not seem to matter and I get the same error.

@alvinhom, I tested it out using the latest prometheusexporter code from the Otel contrib repo with the latest Otel collector (v0.54.0) and it seemed to work just fine. Here is the config that I used for testing:

receivers:
  # Dummy receiver that's never used, because a pipeline is required to have one.
  otlp/spanmetrics:
    protocols:
      grpc:
        endpoint: "localhost:12345"
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:
  spanmetrics:
    metrics_exporter: prometheus
    dimensions_cache_size: 500

exporters:
  logging:
    logLevel: debug
  prometheus:
    endpoint: "localhost:8889"
    send_timestamps: true
    namespace: "test"
    enable_open_metrics: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [spanmetrics]
      exporters: [logging]
    metrics:
      receivers: [otlp/spanmetrics]
      exporters: [prometheus]

@mx-psi
Copy link
Member

mx-psi commented Jul 1, 2022

@alvinhom if the issue is not (necessarily) related to this PR, I would suggest you open an issue for this, so that more people can easily see it and participate

@alvinhom
Copy link

alvinhom commented Jul 1, 2022

I am pretty sure the issue is related to this PR as the error message is coming from the updated golang library and the collector was working correctly until I upgraded. Maybe it has something to do with exemplar at the Inf range. I am not sure. I will try to get more information by adding a logging to the upstream Java agent to log all the metrics with exemplars and then see if it will help.

@alvinhom
Copy link

alvinhom commented Jul 1, 2022

I think the issue is if there are exemplars associated with the last bucket x:Inf.

I added the following code to the collector_test.go:TestConvertDoubleHistogramExemplar

exemplars := []prometheus.Exemplar{
	{
		Timestamp: exemplarTs,
		Value:     3,
		Labels:    prometheus.Labels{"test_label_0": "label_value_0"},
	},
	{
		Timestamp: exemplarTs,
		Value:     50,
		Labels:    prometheus.Labels{"test_label_1": "label_value_1"},
	},
	{
		Timestamp: exemplarTs,
		Value:     78,
		Labels:    prometheus.Labels{"test_label_2": "label_value_2"},
	},
	**{
		Timestamp: exemplarTs,
		Value:     100,
		Labels:    prometheus.Labels{"test_label_3": "label_value_3"},
	},**
}

require.Equal(t, 100.0, buckets[3].GetExemplar().GetValue())
require.Equal(t, int32(128654848), buckets[3].GetExemplar().GetTimestamp().GetNanos())
require.Equal(t, 1, len(buckets[3].GetExemplar().GetLabel()))
require.Equal(t, "test_label_3", buckets[3].GetExemplar().GetLabel()[0].GetName())
require.Equal(t, "label_value_3", buckets[3].GetExemplar().GetLabel()[0].GetValue())

Basically, adding an exemplar associated with the last bucket.

When I tried to run the test, I get the same errors:

--- FAIL: TestConvertDoubleHistogramExemplar (0.00s)
panic: no bucket was found for given exemplar value [recovered]
panic: no bucket was found for given exemplar value

goroutine 148 [running]:
testing.tRunner.func1.2({0x4597040, 0x5dc9da0})
/usr/local/Cellar/go/1.18.3/libexec/src/testing/testing.go:1389 +0x366
testing.tRunner.func1()
/usr/local/Cellar/go/1.18.3/libexec/src/testing/testing.go:1392 +0x5d2
panic({0x4597040, 0x5dc9da0})
/usr/local/Cellar/go/1.18.3/libexec/src/runtime/panic.go:844 +0x258
github.com/prometheus/client_golang/prometheus.(*withExemplarsMetric).Write(0xc000492a50, 0xc000a93ab0)
/Users/ahom/Downloads/dev/go/pkg/mod/github.com/prometheus/client_golang@v1.12.2-0.20220318110013-3bc8f2c651ff/prometheus/metric.go:188 +0x465
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.TestConvertDoubleHistogramExemplar(0x0?)
/Users/ahom/Downloads/dev/linux/opentelemetry-collector-contrib-snapshot/exporter/prometheusexporter/collector_test.go:161 +0x11c8
testing.tRunner(0xc0006031e0, 0x4eab610)
/usr/local/Cellar/go/1.18.3/libexec/src/testing/testing.go:1439 +0x214
created by testing.(*T).Run
/usr/local/Cellar/go/1.18.3/libexec/src/testing/testing.go:1486 +0x725
FAIL github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter 1.354s
FAIL
make[2]: *** [test] Error 1

@mx-psi
Copy link
Member

mx-psi commented Jul 4, 2022

So indeed the panic only happens after this PR. From the comment above the line where this panics:
https://github.com/prometheus/client_golang/blob/66837e3298bdc57a828794c23bacb253ed8c04cd/prometheus/metric.go#L187-L188

I would assume this is not supported and that the problem comes from the application producing the data, but it would be great to confirm this

@arun-shopify
Copy link
Contributor Author

arun-shopify commented Jul 4, 2022

support for -Inf bucket was removed in the latest prometheus/client_golang release, so panic for -inf would make sense, but inf bucket should still work I suppose: https://github.com/prometheus/client_golang/releases/tag/v1.12.2

@arun-shopify
Copy link
Contributor Author

So the panic is because the value found does not belong to any bucket bounds. @alvinhom would the value happen be outside of the default bucket bound DefBuckets = []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10} ?

@arun-shopify
Copy link
Contributor Author

@alvinhom in your test case, if the value is 89 it works since it is less than the highest upper bound in: []float64{5, 25, 90}

mx-psi added a commit to mx-psi/opentelemetry-collector-contrib that referenced this pull request Jul 5, 2022
mx-psi added a commit to mx-psi/opentelemetry-collector-contrib that referenced this pull request Jul 5, 2022
@mx-psi
Copy link
Member

mx-psi commented Jul 5, 2022

@arun-shopify @alvinhom after discussion with the release manager, since this is a widely used component and the issue can cause the Collector to crash without any known workaround, we have decided to revert this PR before it lands in any Collector release (the revert PR is #12074).

The functionality added by this PR can of course be accepted in a future PR, after we understand the issue better.

mx-psi added a commit that referenced this pull request Jul 5, 2022
…ong with… (#9945)" (#12074)

* Revert "Adding a feature to prometheusexporter to export exemplars along with… (#9945)"

This reverts commit e5a0a29.

* Remove release note for #9945
@arun-shopify
Copy link
Contributor Author

@alvinhom would you be able share the config you are using?

Konig-Corey pushed a commit to Konig-Corey/opentelemetry-collector-contrib that referenced this pull request Jul 5, 2022
…ong with… (open-telemetry#9945)" (open-telemetry#12074)

* Revert "Adding a feature to prometheusexporter to export exemplars along with… (open-telemetry#9945)"

This reverts commit e5a0a29.

* Remove release note for open-telemetry#9945
@alvinhom
Copy link

alvinhom commented Jul 5, 2022

So indeed the panic only happens after this PR. From the comment above the line where this panics: https://github.com/prometheus/client_golang/blob/66837e3298bdc57a828794c23bacb253ed8c04cd/prometheus/metric.go#L187-L188

I would assume this is not supported and that the problem comes from the application producing the data, but it would be great to confirm this

The metric data is produced by the Java opentelemetry agent. The agent just outputs the data and forwards to the otel-collector, and I use prometheus exporter to export that to Prometheus.I do not have any custom instrumentation code.

Also, I would assume that exemplar data from the largest boundary to inf should be supported. I am not sure why it is not supported.

@alvinhom
Copy link

alvinhom commented Jul 5, 2022

So the panic is because the value found does not belong to any bucket bounds. @alvinhom would the value happen be outside of the default bucket bound DefBuckets = []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10} ?

We are using the default bucket bounds. But I am pretty sure that we have HTTP requests that have greater than 10s response time as there some some code that issue complex query to DB. But we definitely want to capture those exemplars as those are valuable for our investigation.

@alvinhom
Copy link

alvinhom commented Jul 5, 2022

@alvinhom would you be able share the config you are using?

The data pipeline is like this:

OpenTelemetry Java Instrumentation agent (with auto instrumentation) --> OTel Collector Contrib --> Prometheus

The collector configuration file is like this:

receivers:
  otlp:
    protocols:
      grpc:
processors:
  tail_sampling:
    decision_wait: 5s
    policies:
      [
        {
           name: probability,
           type: probabilistic,
           probabilistic: { sampling_percentage: 10 },
        },
        { name: error,
          type: status_code,
          status_code: { status_codes: [ERROR]} },
        { name: healthcheck,
          type: string_attribute,
          string_attribute: {
               key: 'http.target',
               values: ['^\/(?:metrics\/.*|.*readiness|.*liveness|.*version|.*health|.*version.jsp|.*health.jsp)$'],
               enabled_regex_matching: true,
               invert_match: true } },
        { name: slow,
          type: latency,
          latency: { threshold_ms: 1000 } },
      ]
exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    resource_to_telemetry_conversion:
      enabled: true
    enable_open_metrics: true
  jaeger:
    endpoint: "jaeger-es-collector.observability.svc:14250"
    tls:
      insecure: true
service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [prometheus]
    traces:
      receivers: [otlp]
      processors: [tail_sampling]
      exporters: [jaeger]

Konig-Corey pushed a commit to Konig-Corey/opentelemetry-collector-contrib that referenced this pull request Jul 6, 2022
…ong with… (open-telemetry#9945)" (open-telemetry#12074)

* Revert "Adding a feature to prometheusexporter to export exemplars along with… (open-telemetry#9945)"

This reverts commit e5a0a29.

* Remove release note for open-telemetry#9945
Konig-Corey pushed a commit to Konig-Corey/opentelemetry-collector-contrib that referenced this pull request Jul 6, 2022
…ong with… (open-telemetry#9945)" (open-telemetry#12074)

* Revert "Adding a feature to prometheusexporter to export exemplars along with… (open-telemetry#9945)"

This reverts commit e5a0a29.

* Remove release note for open-telemetry#9945
@tanner-bruce
Copy link

Root cause of the issue appears to be in prometheus/client_golang. The const code assumes that the +Inf bucket cannot have exemplars, which is incorrect. The +Inf bucket is assumed to be implicit, but if it has exemplars, that bucket is made explicit. A similar approach should work for the const buckets.

Reverting seems overkill here rather than simply filtering out exemplars whose value is greater than the final bucket size and fixing forward, but too late for that.

@alvinhom
Copy link

alvinhom commented Jul 8, 2022

I implemented the workaround as suggested by above to filter out exemplars at the +Inf bucket and the panic is gone. The fix should really be in the prometheus/client_golang.

I have also found that with my pipeline from Java agent -> otel collector, the trace id is not set in the prometheus labels. I had to add code to get the traceid and spanid and add it to the Prometheus labels.

With the above changes, I was able to get the exemplars working and integrate it with Grafana and see the exemplars.

atoulme pushed a commit to atoulme/opentelemetry-collector-contrib that referenced this pull request Jul 16, 2022
open-telemetry#9945)

* Adding a feature to prometheusexporter to export exemplars along with histogram metrics

* lint fix and make gotidy

* read me update

* go mod tidy
atoulme pushed a commit to atoulme/opentelemetry-collector-contrib that referenced this pull request Jul 16, 2022
…ong with… (open-telemetry#9945)" (open-telemetry#12074)

* Revert "Adding a feature to prometheusexporter to export exemplars along with… (open-telemetry#9945)"

This reverts commit e5a0a29.

* Remove release note for open-telemetry#9945
@arun-shopify
Copy link
Contributor Author

new PR with the fix:
#13127

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants