Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ThreadRegistry#register behavior to ensure correct Prom metrics #4300

Merged
merged 4 commits into from
May 22, 2024

Conversation

michaeljmarshall
Copy link
Member

@michaeljmarshall michaeljmarshall commented Apr 19, 2024

Descriptions of the changes in this PR:

There is an issue in the ThreadRegistry that causes incomplete Prometheus metrics reporting. It is trivial to reproduce:

  1. Start bookies with multiple journal directories.
  2. Start multiple clients writing to the bookies.
  3. Observe that the bookkeeper_server_ADD_ENTRY_count does not increment at a similar rate to the journal's add entry rate.

The bug can be understood abstractly by considering the following:

  1. The threadPoolMap in ThreadRegistry goes from thread id to ThreadPoolThread.
  2. The ThreadPoolThread is defined by the thread pool and the configured id.
  3. Every time we set the id for step 2, we use 0.
  4. When we have multiple journals, we call the same ThreadRegistry.register(journalThreadName, 0); method multiple times.
  5. The threadPoolMap ends up with multiple identical values for different keys.
  6. ThreadScopedDataSketchesStatsLogger#getStatsLogger stores the identical values in the provider.opStats map and ultimately overwrites map values, which leads to an under-reporting of metrics.

Motivation

Correct bookkeeper metrics reporting by ensuring that threads are properly registered in the ThreadRegistry.

Changes

I replaced the static map in ThreadRegistry#register with a static ThreadLocal instance. While the original work in #2839, which was part of #2835, specifically indicated it did not want to use thread local storage (TLS) due to its complexity, I think it is worth it because of the added flexibility. Specifically, it is much simpler to handle cleanup of unnecessary metadata by using thread locals. Note that the thread id, previously used as the key in the static map, is not unique to the thread after the thread goes away, which poses problems if any threads recycle. While BK doesn't generally recycle these threads, the test framework does, so it helps to make tests pass while still being able to make useful assertions about state.

An interesting design detail that I noticed and did not change is documented in the ThreadRegistry. Specifically, it is that threads that go away will continue to report metrics. I think this is fine, but it's worth calling out.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Catch

This is a bad bug affecting also 4.16. we will discuss cutting an hotfix release also for that branch

@michaeljmarshall
Copy link
Member Author

Notably, this PR is particularly relevant after 4.16 because bookkeeper_server_ADD_ENTRY_count is recommended as a replacement for bookkeeper_server_ADD_ENTRY_REQUESTS_count.

See apache/pulsar#21766, #4179, #3837

@lhotari
Copy link
Member

lhotari commented Apr 19, 2024

Great catch @michaeljmarshall

@lhotari
Copy link
Member

lhotari commented Apr 19, 2024

checkstyle error

Error:  /home/runner/work/bookkeeper/bookkeeper/stats/bookkeeper-stats-api/src/main/java/org/apache/bookkeeper/stats/ThreadRegistry.java:55: Line is longer than 120 characters (found 156). [LineLength]
Audit done.
[INFO] There is 1 error reported by Checkstyle 9.3 with buildtools/src/main/resources/bookkeeper/checkstyle.xml ruleset.
Error:  src/main/java/org/apache/bookkeeper/stats/ThreadRegistry.java:[55] (sizes) LineLength: Line is longer than 120 characters (found 156).

Copy link
Member

@shoothzj shoothzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice Catch.

Copy link
Contributor

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Great work

@eolivelli
Copy link
Contributor

in order to let the tests pass I suggest to remove the assertions and simply use "log.error". In production you will see ERRORs.
The problem is that in tests we launch multiple bookies in the same JVM

Copy link
Contributor

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaeljmarshall tests are failing because most of them spin up N bookies in the same JVM, and since ThreadRegistry is static they overlap each other.

I suggest we replace the assertion with a log.error

@shoothzj
Copy link
Member

@eolivelli @nicoloboschi Agreed, I think we can have an issue to track subsequent refactorings to non-static.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I added two minor suggestions

@nicoloboschi nicoloboschi merged commit 1d82b92 into apache:master May 22, 2024
21 checks passed
@nicoloboschi nicoloboschi added this to the 4.18.0 milestone May 22, 2024
shoothzj pushed a commit that referenced this pull request May 25, 2024
…4300)

* Make tests fail

* Fix ThreadRegistry#register to ensure correct Prom metrics

* Change style to match BK standards

* Fix tests

---------

Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>
(cherry picked from commit 1d82b92)
@hangc0276
Copy link
Contributor

Hi @michaeljmarshall Could you add a test to protect this change?

zhiheng123 pushed a commit to zhiheng123/bookkeeper that referenced this pull request May 28, 2024
…pache#4300)

* Make tests fail

* Fix ThreadRegistry#register to ensure correct Prom metrics

* Change style to match BK standards

* Fix tests

---------

Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>

---------

(cherry picked from commit 1d82b92)
Conflicts:
  bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/BookieImpl.java
  bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/SyncThread.java
shoothzj pushed a commit that referenced this pull request May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants