Memory leak in native space when metrics enabled [HZ-2074] #23492

sumnerib · 2023-01-26T16:58:56Z

Running SQL soak tests in hazelcast-jet-ansible-tests with the default environment defined in hazelcast-jet-ansible results in 1 or more members processes being OOM-killed by the OS.

From system logs:

Jan 16 23:18:43 ip-10-0-0-156 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=java,pid=16797,uid=1000
Jan 16 23:18:43 ip-10-0-0-156 kernel: Out of memory: Killed process 16797 (java) total-vm:11187400kB, anon-rss:7595924kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:18880kB oom_score_adj:0
Jan 16 23:18:44 ip-10-0-0-156 hazelcast-jetd: /usr/local/sbin/hazelcast-jetd: line 3: 16797 Killed                  /home/ec2-user/java/bin/java -Xms4G -Xmx4G -XX:MaxMetaspaceSize=256M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/ec2-user/ansible/hazelcast-enterprise-5.3.0-SNAPSHOT/logs/memory.dump -Xloggc:/home/ec2-user/ansible/hazelcast-enterprise-5.3.0-SNAPSHOT/logs/hazelcast-jet.gc.log -verbose:gc -javaagent:/home/ec2-user/ansible/jolokia-jvm-1.3.7-agent.jar -Dlog4j.configurationFile=file:/home/ec2-user/ansible/hazelcast-enterprise-5.3.0-SNAPSHOT/config/log4j2.properties -Dhazelcast.jmx=true -Dhazelcast.config=/home/ec2-user/ansible/hazelcast-enterprise-5.3.0-SNAPSHOT/config/hazelcast.yaml -cp /home/ec2-user/ansible/hazelcast-enterprise-5.3.0-SNAPSHOT/lib/*:/home/ec2-user/ansible/cache-api-1.1.1.jar com.hazelcast.core.server.HazelcastMemberStarter > /home/ec2-user/ansible/hazelcast-enterprise-5.3.0-SNAPSHOT/logs/hazelcast-jet.output.log 2>&1
Jan 16 23:18:44 ip-10-0-0-156 systemd: hazelcast-jetd.service: main process exited, code=exited, status=137/n/a

No Java OOMs or other errors are reported in the member logs before they are killed. Also JVM heap usage for each member stays around 50-60% for the entire test run. Heap is sized to 4gb max but we can see in the above that RSS of the member java process is ~7.6gb. Total physical memory of the instance is 8gb. pmap data of the member process indicates most of the usage outside of heap belongs to the MALLOC_ARENA:

Memory mappings:
         JAVA_HEAP count=1     reserved=5242880    rss=4206264   
       MAPPED_FILE count=25    reserved=173820     rss=11228     
           UNKNOWN count=218   reserved=928160     rss=593140    
      MALLOC_ARENA count=68    reserved=4456448    rss=2495704
  UNKNOWN_SEGMENT2 count=16    reserved=32768      rss=32760     
SHARED_MAPPED_FILE count=73    reserved=25484      rss=332       
  UNKNOWN_SEGMENT1 count=14    reserved=43064      rss=30008     
         CODE_HEAP count=1     reserved=245760     rss=69000     

stats: vsz=11148384 kib 
 rss=7438436 kib

(Aggregated pmap data produced using https://github.com/bric3/java-pmap-inspector see here)

Running the soak tests with async-profiler set to sample malloc or mprotec syscalls produces the following flamegraph:

This shows a significant number of native allocations coming from Jet job metrics collecting/publishing (via java.util.zip.Deflater). Not closing of Inflater/Deflater is a known cause of java native memory leaks (i.e. https://medium.com/swlh/native-memory-the-silent-jvm-killer-595913cba8e7). Investigating JobsMetricsPublisher and MetricsCompressor (which has Deflaters as its members) shows that, when metrics are published for a job, old MetricsCompressors are released. However the Deflater's it's holding aren't explicitly end()ed. This means their resources will not be released until their respective finalize() methods are called. While finalization of these objects does release their resources, the finalization thread that runs as part of java GC has a low priority. Issues can occur when objects are enqueued for finalization faster than they can be handled by the finalizer thread. We can check for this problem by examining the heap and looking for Deflater instances:

The above shows the most of the live Deflater objects are waiting on finalization. Taking another heap dump later in the run shows that the Deflaters waiting to be finalized grow over time:

Disabling Hazelcast metrics is a workaround for this issue. The members are no longer OOM-killed during the soaks tests and member proc RSS doesn't go beyond 4.7gb. I suspect this issue probably only visible when a lot of Jet jobs are created which is the case with the SQL soak tests.

The text was updated successfully, but these errors were encountered:

sumnerib · 2023-01-26T18:18:13Z

https://hazelcast.atlassian.net/browse/HZ-1969

github-actions · 2023-02-22T10:51:01Z

Internal Jira issue: HZ-2074

k-jamroz · 2023-02-22T18:04:30Z

Maybe this is also related to failures like https://jenkins.hazelcast.com/job/Hazelcast-EE-pr-builder/7090/ in HDMapMemoryLeakStressTest: com.hazelcast.memory.NativeOutOfMemoryError: Data allocation request cannot be satisfied! Not enough contiguous memory available! Cannot allocate 3647 bytes! [MaxNative: 128 MB, CommittedNative: 128 MB, UsedNative: 128 MB, FreePhysical: 587 GB]

k-jamroz · 2023-04-06T10:32:50Z

JobMetricsPublisher uses compressor.getBlobAndReset() which in turn ends Deflater and creates a new one for next metric collection. But I can see another problem: JobMetricsPublisher has a executionIdToCompressor map. It removes empty compressors but without ending deflater (created by getBlobAndReset()).

k-jamroz · 2023-04-06T10:44:33Z

This is related to #17145

k-jamroz · 2023-04-06T10:55:32Z

There were also other places where we created short-lived Deflaters - implemented explicit closing.

…24225) Free native memory used by MetricCompressors when they are disposed of deterministically instead of relying on finalizer. Finalizer may be too slow in loaded system. Fixes #23492 [HZ-2074]

sumnerib added Type: Defect Source: Internal PR or issue was opened by an employee Module: Jet Issues/PRs for Jet labels Jan 26, 2023

sumnerib added this to the 5.3 Backlog milestone Jan 26, 2023

sumnerib changed the title ~~Memory leak in native space when metrics enabled [HZ-1969]~~ Memory leak in native space when metrics enabled Jan 26, 2023

degerhz modified the milestones: 5.3 Backlog, 5.3.0 Feb 22, 2023

degerhz added the to-jira label Feb 22, 2023

github-actions bot changed the title ~~Memory leak in native space when metrics enabled~~ Memory leak in native space when metrics enabled [HZ-2074] Feb 22, 2023

k-jamroz self-assigned this Apr 6, 2023

k-jamroz mentioned this issue Apr 6, 2023

Fix memory leak in native space when Jet metrics enabled [HZ-2074] #24225

Merged

3 tasks

k-jamroz closed this as completed in #24225 Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in native space when metrics enabled [HZ-2074] #23492

Memory leak in native space when metrics enabled [HZ-2074] #23492

sumnerib commented Jan 26, 2023

sumnerib commented Jan 26, 2023

github-actions bot commented Feb 22, 2023

k-jamroz commented Feb 22, 2023 •

edited

k-jamroz commented Apr 6, 2023

k-jamroz commented Apr 6, 2023

k-jamroz commented Apr 6, 2023

Memory leak in native space when metrics enabled [HZ-2074] #23492

Memory leak in native space when metrics enabled [HZ-2074] #23492

Comments

sumnerib commented Jan 26, 2023

sumnerib commented Jan 26, 2023

github-actions bot commented Feb 22, 2023

k-jamroz commented Feb 22, 2023 • edited

k-jamroz commented Apr 6, 2023

k-jamroz commented Apr 6, 2023

k-jamroz commented Apr 6, 2023

k-jamroz commented Feb 22, 2023 •

edited