High Kernel memory usage #96892

mitsos1os · 2020-11-26T15:33:26Z

What happened:
Cluster node is using an excessive amount of Noncache kernel memory after deployment of pods, leading to memory starvation problems in the node.
Some command ouputs:

free -m:

              total        used        free      shared  buff/cache   available
Mem:           3895        3470         130           3         294         204
Swap:             0           0           0

This currently shows that my actual used (non-cache or reclaimable memory) is around 3.4GB.

Also the output of sudo smem -twk:

Area                           Used      Cache   Noncache 
firmware/hardware                 0          0          0 
kernel image                      0          0          0 
kernel dynamic memory          1.5G     184.1M       1.3G 
userspace memory               2.2G     111.1M       2.1G 
free memory                  125.5M     125.5M          0 
----------------------------------------------------------
                               3.8G     420.7M       3.4G

matches the output of free in the following way:

used column in free = smem kernel NonCache + userspace Noncache = 3.4GB
buff/cache columne in free = smem kernel Cache + userspace Cache = 294MB

Also kubectl top node matches the userspace memory in smem showing around 2.2GB and so does the total of top and ps aux of the running processes.

However my /proc/meminfo/:

MemTotal:        3989436 kB
MemFree:          133272 kB
MemAvailable:     209416 kB
Buffers:           10472 kB
Cached:           255628 kB
SwapCached:            0 kB
Active:          2340712 kB
Inactive:          80612 kB
Active(anon):    2156712 kB
Inactive(anon):     1752 kB
Active(file):     184000 kB
Inactive(file):    78860 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:              1404 kB
Writeback:             0 kB
AnonPages:       2155264 kB
Mapped:           111500 kB
Shmem:              3220 kB
Slab:             121856 kB
SReclaimable:      36260 kB
SUnreclaim:        85596 kB
KernelStack:       17440 kB
PageTables:        32972 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     1994716 kB
Committed_AS:    8704948 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      518120 kB
DirectMap2M:     3614720 kB
DirectMap1G:           0 kB

shows a total of kernel memory usage Slab + SReclaimable + SUnreclaim of ~238MB which is nowhere near 1.3GB shown in smem which also sums up in the free report.

So where is the extra memory in the kernel spent???

What you expected to happen:
Stable usage of non cache kernel memory even after some time after the deployment.
Is there a way to track what is using the kernel memory?

How to reproduce it (as minimally and precisely as possible):
Deploy the services and after a while kernel memory usage increases to the same level

Anything else we need to know?:
When draining the node, kernel memory usage returns to normal, but as soon as pods are re-deployed it goes up again!

Environment:

Kubernetes version (use kubectl version): 1.16.7
Cloud provider or hardware configuration: AWS EC2 t3.medium 4GB Ram
OS (e.g: cat /etc/os-release): Debian GNU/Linux 9 (stretch)
Kernel (e.g. uname -a): Linux 4.9.0-7-amd64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13) x86_64 GNU/Linux
Install tools: kops
Other: pods deployment through Helm

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2020-11-26T15:33:34Z

@mitsos1os: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

neolit123 · 2020-11-30T20:44:58Z

/sig node
for triage

Kubernetes version (use kubectl version): 1.16.7

note this version is no longer supported. you need to upgrade to 1.18.x soon.

mitsos1os · 2020-12-01T09:24:18Z

@neolit123 I am aware of that, but since upgrading would require an extra effort for integrating breaking changes, we were hoping to resolve this matter since it seems quite problematic as behavior

mitsos1os · 2021-01-04T08:15:33Z

It actually ended up being an issue with some in-app logging directly to FluentD DaemonSet.
You can see the issue here: fluent/fluentd#3202
Closing this as not Kubernetes related

mitsos1os added the kind/bug Categorizes issue or PR as related to a bug. label Nov 26, 2020

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 26, 2020

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 26, 2020

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 30, 2020

mitsos1os closed this as completed Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Kernel memory usage #96892

High Kernel memory usage #96892

mitsos1os commented Nov 26, 2020

k8s-ci-robot commented Nov 26, 2020

neolit123 commented Nov 30, 2020

mitsos1os commented Dec 1, 2020

mitsos1os commented Jan 4, 2021

High Kernel memory usage #96892

High Kernel memory usage #96892

Comments

mitsos1os commented Nov 26, 2020

k8s-ci-robot commented Nov 26, 2020

neolit123 commented Nov 30, 2020

mitsos1os commented Dec 1, 2020

mitsos1os commented Jan 4, 2021