Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in native space when metrics enabled [HZ-2074] #23492

Closed
sumnerib opened this issue Jan 26, 2023 · 6 comments · Fixed by #24225
Closed

Memory leak in native space when metrics enabled [HZ-2074] #23492

sumnerib opened this issue Jan 26, 2023 · 6 comments · Fixed by #24225
Assignees
Labels
Module: Jet Issues/PRs for Jet Source: Internal PR or issue was opened by an employee to-jira Type: Defect
Milestone

Comments

@sumnerib
Copy link
Contributor

Running SQL soak tests in hazelcast-jet-ansible-tests with the default environment defined in hazelcast-jet-ansible results in 1 or more members processes being OOM-killed by the OS.

From system logs:

Jan 16 23:18:43 ip-10-0-0-156 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=java,pid=16797,uid=1000
Jan 16 23:18:43 ip-10-0-0-156 kernel: Out of memory: Killed process 16797 (java) total-vm:11187400kB, anon-rss:7595924kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:18880kB oom_score_adj:0
Jan 16 23:18:44 ip-10-0-0-156 hazelcast-jetd: /usr/local/sbin/hazelcast-jetd: line 3: 16797 Killed                  /home/ec2-user/java/bin/java -Xms4G -Xmx4G -XX:MaxMetaspaceSize=256M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/ec2-user/ansible/hazelcast-enterprise-5.3.0-SNAPSHOT/logs/memory.dump -Xloggc:/home/ec2-user/ansible/hazelcast-enterprise-5.3.0-SNAPSHOT/logs/hazelcast-jet.gc.log -verbose:gc -javaagent:/home/ec2-user/ansible/jolokia-jvm-1.3.7-agent.jar -Dlog4j.configurationFile=file:/home/ec2-user/ansible/hazelcast-enterprise-5.3.0-SNAPSHOT/config/log4j2.properties -Dhazelcast.jmx=true -Dhazelcast.config=/home/ec2-user/ansible/hazelcast-enterprise-5.3.0-SNAPSHOT/config/hazelcast.yaml -cp /home/ec2-user/ansible/hazelcast-enterprise-5.3.0-SNAPSHOT/lib/*:/home/ec2-user/ansible/cache-api-1.1.1.jar com.hazelcast.core.server.HazelcastMemberStarter > /home/ec2-user/ansible/hazelcast-enterprise-5.3.0-SNAPSHOT/logs/hazelcast-jet.output.log 2>&1
Jan 16 23:18:44 ip-10-0-0-156 systemd: hazelcast-jetd.service: main process exited, code=exited, status=137/n/a

No Java OOMs or other errors are reported in the member logs before they are killed. Also JVM heap usage for each member stays around 50-60% for the entire test run. Heap is sized to 4gb max but we can see in the above that RSS of the member java process is ~7.6gb. Total physical memory of the instance is 8gb. pmap data of the member process indicates most of the usage outside of heap belongs to the MALLOC_ARENA:

Memory mappings:
         JAVA_HEAP count=1     reserved=5242880    rss=4206264   
       MAPPED_FILE count=25    reserved=173820     rss=11228     
           UNKNOWN count=218   reserved=928160     rss=593140    
      MALLOC_ARENA count=68    reserved=4456448    rss=2495704
  UNKNOWN_SEGMENT2 count=16    reserved=32768      rss=32760     
SHARED_MAPPED_FILE count=73    reserved=25484      rss=332       
  UNKNOWN_SEGMENT1 count=14    reserved=43064      rss=30008     
         CODE_HEAP count=1     reserved=245760     rss=69000     

stats: vsz=11148384 kib 
 rss=7438436 kib

(Aggregated pmap data produced using https://github.com/bric3/java-pmap-inspector see here)

Running the soak tests with async-profiler set to sample malloc or mprotec syscalls produces the following flamegraph:
mprotect-sampling-metrics-enabled

This shows a significant number of native allocations coming from Jet job metrics collecting/publishing (via java.util.zip.Deflater). Not closing of Inflater/Deflater is a known cause of java native memory leaks (i.e. https://medium.com/swlh/native-memory-the-silent-jvm-killer-595913cba8e7). Investigating JobsMetricsPublisher and MetricsCompressor (which has Deflaters as its members) shows that, when metrics are published for a job, old MetricsCompressors are released. However the Deflater's it's holding aren't explicitly end()ed. This means their resources will not be released until their respective finalize() methods are called. While finalization of these objects does release their resources, the finalization thread that runs as part of java GC has a low priority. Issues can occur when objects are enqueued for finalization faster than they can be handled by the finalizer thread. We can check for this problem by examining the heap and looking for Deflater instances:

image

The above shows the most of the live Deflater objects are waiting on finalization. Taking another heap dump later in the run shows that the Deflaters waiting to be finalized grow over time:

image

Disabling Hazelcast metrics is a workaround for this issue. The members are no longer OOM-killed during the soaks tests and member proc RSS doesn't go beyond 4.7gb. I suspect this issue probably only visible when a lot of Jet jobs are created which is the case with the SQL soak tests.

@sumnerib sumnerib added Type: Defect Source: Internal PR or issue was opened by an employee Module: Jet Issues/PRs for Jet labels Jan 26, 2023
@sumnerib sumnerib added this to the 5.3 Backlog milestone Jan 26, 2023
@sumnerib
Copy link
Contributor Author

@sumnerib sumnerib changed the title Memory leak in native space when metrics enabled [HZ-1969] Memory leak in native space when metrics enabled Jan 26, 2023
@degerhz degerhz modified the milestones: 5.3 Backlog, 5.3.0 Feb 22, 2023
@github-actions github-actions bot changed the title Memory leak in native space when metrics enabled Memory leak in native space when metrics enabled [HZ-2074] Feb 22, 2023
@github-actions
Copy link
Contributor

Internal Jira issue: HZ-2074

@k-jamroz
Copy link
Contributor

k-jamroz commented Feb 22, 2023

Maybe this is also related to failures like https://jenkins.hazelcast.com/job/Hazelcast-EE-pr-builder/7090/ in HDMapMemoryLeakStressTest: com.hazelcast.memory.NativeOutOfMemoryError: Data allocation request cannot be satisfied! Not enough contiguous memory available! Cannot allocate 3647 bytes! [MaxNative: 128 MB, CommittedNative: 128 MB, UsedNative: 128 MB, FreePhysical: 587 GB]

@k-jamroz
Copy link
Contributor

k-jamroz commented Apr 6, 2023

JobMetricsPublisher uses compressor.getBlobAndReset() which in turn ends Deflater and creates a new one for next metric collection. But I can see another problem: JobMetricsPublisher has a executionIdToCompressor map. It removes empty compressors but without ending deflater (created by getBlobAndReset()).

@k-jamroz
Copy link
Contributor

k-jamroz commented Apr 6, 2023

This is related to #17145

@k-jamroz
Copy link
Contributor

k-jamroz commented Apr 6, 2023

There were also other places where we created short-lived Deflaters - implemented explicit closing.

k-jamroz added a commit that referenced this issue Apr 18, 2023
…24225)

Free native memory used by MetricCompressors when they are disposed of
deterministically instead of relying on finalizer. Finalizer may be too
slow in loaded system.

Fixes #23492 [HZ-2074]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Module: Jet Issues/PRs for Jet Source: Internal PR or issue was opened by an employee to-jira Type: Defect
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants