Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prometheus] Fixed a race condition happening in simultaneous OpenMetrics and PlainText requests #5517

Closed
wants to merge 31 commits into from

Conversation

hez2010
Copy link

@hez2010 hez2010 commented Apr 8, 2024

Fixes #5515

Changes

Fixed a race condition which can cause returning an incorrect response while OpenMetrics requests and PlainText requests are being processed simultaneously.

Merge requirement checklist

  • CONTRIBUTING guidelines followed (license requirements, nullable enabled, static analysis, etc.)

@hez2010 hez2010 requested a review from a team as a code owner April 8, 2024 05:24
Copy link

linux-foundation-easycla bot commented Apr 8, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link

codecov bot commented Apr 8, 2024

Codecov Report

Attention: Patch coverage is 71.64179% with 19 lines in your changes are missing coverage. Please review.

Project coverage is 85.49%. Comparing base (6250307) to head (665e390).
Report is 222 commits behind head on main.

❗ Current head 665e390 differs from pull request most recent head aed502c. Consider uploading reports for the commit aed502c to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5517      +/-   ##
==========================================
+ Coverage   83.38%   85.49%   +2.11%     
==========================================
  Files         297      289       -8     
  Lines       12531    12515      -16     
==========================================
+ Hits        10449    10700     +251     
+ Misses       2082     1815     -267     
Flag Coverage Δ
unittests ?
unittests-Solution-Experimental 85.25% <71.64%> (?)
unittests-Solution-Stable 85.45% <71.64%> (?)
unittests-Unstable-Core 20.01% <71.64%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...metheus.AspNetCore/PrometheusExporterMiddleware.cs 70.00% <100.00%> (ø)
....Prometheus.HttpListener/PrometheusHttpListener.cs 81.92% <100.00%> (+0.22%) ⬆️
...tpListener/Internal/PrometheusCollectionManager.cs 75.67% <67.79%> (-2.11%) ⬇️

... and 78 files with indirect coverage changes

Co-authored-by: Vishwesh Bankwar <vishweshbankwar@users.noreply.github.com>
@hez2010
Copy link
Author

hez2010 commented Apr 9, 2024

The race condition is still present and need more work. I'm converting it to draft for now.

@hez2010 hez2010 marked this pull request as draft April 9, 2024 17:36
@reyang
Copy link
Member

reyang commented Apr 9, 2024

Thank you for looking into this @hez2010!
Adding @robertcoltheart as FYI.

I think we need some architecture change here. The original design caches the response with the assumption that the same response payload can be used for all scrapers. With the ability to support both Prometheus (text/plain; charset=utf-8; version=0.0.4) and OpenMetrics (application/openmetrics-text; version=1.0.0; charset=utf-8) format, I guess we need to have separate cache for each potential response body, or we don't cache at all.

@hez2010 hez2010 marked this pull request as ready for review April 11, 2024 08:36
@hez2010
Copy link
Author

hez2010 commented Apr 11, 2024

Now it's ready for review.
/cc: @reyang @robertcoltheart @vishweshbankwar @CodeBlanch

@hez2010 hez2010 changed the title Reset cached target info cursor when OpenMetricsRequested is false Fixed a race condition happening in simultaneous OpenMetrics and PlainText requests Apr 11, 2024
Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some minor issues regarding markdownlint and line-ending, LGTM once those are fixed.

@hez2010
Copy link
Author

hez2010 commented Apr 19, 2024

I pushed a new commit to simplify the code, and resolved another race condition due to the bad exchange of this.collectionTcs.
The test PrometheusExporterMiddlewareIntegration_MultipleRequestsOfDifferentFormatsInParallel now passed with even 100,000,000 requests on my machine.

@vishweshbankwar
Copy link
Member

I pushed a new commit to simplify the code, and resolved another race condition due to the bad exchange of this.collectionTcs. The test PrometheusExporterMiddlewareIntegration_MultipleRequestsOfDifferentFormatsInParallel now passed with even 100,000,000 requests on my machine.

@hez2010 - Could you elaborate what was the race condition?

@hez2010
Copy link
Author

hez2010 commented Apr 19, 2024

@hez2010 - Could you elaborate what was the race condition?

We are neither using lock (obj) nor inserting a memory barrier around this.collectionTcs = ..., so it can sometimes result in accessing an invalid collectionTcs by another thread due to memory order issues.
Here I changed it to use CompareExchange instead.

You can observe the issue by running the test I added (and better to increase the request times). Without the change, the test sometimes never finishes.

@vishweshbankwar
Copy link
Member

@hez2010 - Could you elaborate what was the race condition?

We are neither using lock (obj) nor inserting a memory barrier around this.collectionTcs = ..., so it can sometimes result in accessing an invalid collectionTcs by another thread due to memory order issues. Here I changed it to use CompareExchange instead.

You can observe the issue by running the test I added (and better to increase the request times). Without the change, the test sometimes never finishes.

This issue would surface even without any of these changes? Meaning, if we just run the test with single value passed here

@hez2010
Copy link
Author

hez2010 commented Apr 20, 2024

Seems that I accidentally broke the cache. Working on a fix.

@hez2010
Copy link
Author

hez2010 commented Apr 20, 2024

It turns out to be an invalid assert in the test.

@hez2010
Copy link
Author

hez2010 commented Apr 20, 2024

This issue would surface even without any of these changes? Meaning, if we just run the test with single value passed here

Yes. My newly added test sometimes never finishes against the existing code as well.

@hez2010
Copy link
Author

hez2010 commented Apr 20, 2024

All things should work expected now. PTAL.

@vishweshbankwar
Copy link
Member

This issue would surface even without any of these changes? Meaning, if we just run the test with single value passed here

Yes. My newly added test sometimes never finishes against the existing code as well.

Thanks for confirming @hez2010. Looks like this PR is now trying to solve two issues.

  1. Issue of supporting two formats simultaneously.
  2. Concurrency issue.

Here is what I think we should do.

  1. Keep this PR simple and solve the issue of not supporting two formats simultaneously. The original change you had by simply adding two buffers.

  2. Propose a separate change for fixing concurrency issue. A detailed issue with a simple test first would be helpful.

@hez2010
Copy link
Author

hez2010 commented Apr 23, 2024

I can separate this PR into two parts as you mentioned, but the tests will gonna fail without the fix for the concurrency issue (I think you may have noticed that some previous CI run timeouts because they failed to complete the test due to this) so that the PR will not be a good state that can be merged IMO.

@vishweshbankwar
Copy link
Member

I can separate this PR into two parts as you mentioned, but the tests will gonna fail without the fix for the concurrency issue (I think you may have noticed that some previous CI run timeouts because they failed to complete the test due to this) so that the PR will not be a good state that can be merged IMO.

For the test, you can change it to sequential execution for now. The test should validate that the exporter can serve both the formats. When working on concurrency issue, can add the parallel test back.

Copy link
Contributor

github-actions bot commented May 1, 2024

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

@github-actions github-actions bot added Stale and removed Stale labels May 1, 2024
@vishweshbankwar
Copy link
Member

@hez2010 - Checking to see if you are planning to do the changes based on the discussion above?

@hez2010
Copy link
Author

hez2010 commented May 6, 2024

@hez2010 - Checking to see if you are planning to do the changes based on the discussion above?

Thanks for pinging!

Sorry for the delay. I have been busy recently so may not be able to apply the changes right now.
While feel free to pick up the PR if it's urgent.

Copy link
Contributor

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

@github-actions github-actions bot added the Stale label May 14, 2024
@robertcoltheart
Copy link
Contributor

I can probably pick this up tonight if others are busy.

@robertcoltheart
Copy link
Contributor

I've raised a PR for the first issue, which is splitting the buffers here: #5623 It also optimizes the generation to only the type requested, instead of generating both open metrics and plain.

@vishweshbankwar
Copy link
Member

closing in favor of #5623

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg:OpenTelemetry.Exporter.Prometheus.AspNetCore Issues related to OpenTelemetry.Exporter.Prometheus.AspNetCore NuGet package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prometheus exporter sometimes produces malformed data
6 participants