-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in native space when metrics enabled [HZ-2074] #23492
Comments
Internal Jira issue: HZ-2074 |
|
|
This is related to #17145 |
There were also other places where we created short-lived Deflaters - implemented explicit closing. |
Running SQL soak tests in hazelcast-jet-ansible-tests with the default environment defined in hazelcast-jet-ansible results in 1 or more members processes being OOM-killed by the OS.
From system logs:
No Java OOMs or other errors are reported in the member logs before they are killed. Also JVM heap usage for each member stays around 50-60% for the entire test run. Heap is sized to 4gb max but we can see in the above that RSS of the member java process is ~7.6gb. Total physical memory of the instance is 8gb.
pmap
data of the member process indicates most of the usage outside of heap belongs to theMALLOC_ARENA
:(Aggregated
pmap
data produced using https://github.com/bric3/java-pmap-inspector see here)Running the soak tests with async-profiler set to sample
malloc
ormprotec
syscalls produces the following flamegraph:This shows a significant number of native allocations coming from Jet job metrics collecting/publishing (via
java.util.zip.Deflater
). Not closing ofInflater
/Deflater
is a known cause of java native memory leaks (i.e. https://medium.com/swlh/native-memory-the-silent-jvm-killer-595913cba8e7). InvestigatingJobsMetricsPublisher
andMetricsCompressor
(which hasDeflater
s as its members) shows that, when metrics are published for a job, oldMetricsCompressor
s are released. However theDeflater
's it's holding aren't explicitlyend()
ed. This means their resources will not be released until their respectivefinalize()
methods are called. While finalization of these objects does release their resources, the finalization thread that runs as part of java GC has a low priority. Issues can occur when objects are enqueued for finalization faster than they can be handled by the finalizer thread. We can check for this problem by examining the heap and looking forDeflater
instances:The above shows the most of the live
Deflater
objects are waiting on finalization. Taking another heap dump later in the run shows that theDeflater
s waiting to be finalized grow over time:Disabling Hazelcast metrics is a workaround for this issue. The members are no longer OOM-killed during the soaks tests and member proc RSS doesn't go beyond 4.7gb. I suspect this issue probably only visible when a lot of Jet jobs are created which is the case with the SQL soak tests.
The text was updated successfully, but these errors were encountered: