Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Kernel memory usage #96892

Closed
mitsos1os opened this issue Nov 26, 2020 · 4 comments
Closed

High Kernel memory usage #96892

mitsos1os opened this issue Nov 26, 2020 · 4 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@mitsos1os
Copy link

What happened:
Cluster node is using an excessive amount of Noncache kernel memory after deployment of pods, leading to memory starvation problems in the node.
Some command ouputs:

free -m:

              total        used        free      shared  buff/cache   available
Mem:           3895        3470         130           3         294         204
Swap:             0           0           0

This currently shows that my actual used (non-cache or reclaimable memory) is around 3.4GB.

Also the output of sudo smem -twk:

Area                           Used      Cache   Noncache 
firmware/hardware                 0          0          0 
kernel image                      0          0          0 
kernel dynamic memory          1.5G     184.1M       1.3G 
userspace memory               2.2G     111.1M       2.1G 
free memory                  125.5M     125.5M          0 
----------------------------------------------------------
                               3.8G     420.7M       3.4G

matches the output of free in the following way:

  • used column in free = smem kernel NonCache + userspace Noncache = 3.4GB
  • buff/cache columne in free = smem kernel Cache + userspace Cache = 294MB

Also kubectl top node matches the userspace memory in smem showing around 2.2GB and so does the total of top and ps aux of the running processes.

However my /proc/meminfo/:

MemTotal:        3989436 kB
MemFree:          133272 kB
MemAvailable:     209416 kB
Buffers:           10472 kB
Cached:           255628 kB
SwapCached:            0 kB
Active:          2340712 kB
Inactive:          80612 kB
Active(anon):    2156712 kB
Inactive(anon):     1752 kB
Active(file):     184000 kB
Inactive(file):    78860 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:              1404 kB
Writeback:             0 kB
AnonPages:       2155264 kB
Mapped:           111500 kB
Shmem:              3220 kB
Slab:             121856 kB
SReclaimable:      36260 kB
SUnreclaim:        85596 kB
KernelStack:       17440 kB
PageTables:        32972 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     1994716 kB
Committed_AS:    8704948 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      518120 kB
DirectMap2M:     3614720 kB
DirectMap1G:           0 kB

shows a total of kernel memory usage Slab + SReclaimable + SUnreclaim of ~238MB which is nowhere near 1.3GB shown in smem which also sums up in the free report.

So where is the extra memory in the kernel spent???

What you expected to happen:
Stable usage of non cache kernel memory even after some time after the deployment.
Is there a way to track what is using the kernel memory?

How to reproduce it (as minimally and precisely as possible):
Deploy the services and after a while kernel memory usage increases to the same level

Anything else we need to know?:
When draining the node, kernel memory usage returns to normal, but as soon as pods are re-deployed it goes up again!

Environment:

  • Kubernetes version (use kubectl version): 1.16.7
  • Cloud provider or hardware configuration: AWS EC2 t3.medium 4GB Ram
  • OS (e.g: cat /etc/os-release): Debian GNU/Linux 9 (stretch)
  • Kernel (e.g. uname -a): Linux 4.9.0-7-amd64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Debian 4.9.110-3+deb9u2 (2018-08-13) x86_64 GNU/Linux
  • Install tools: kops
  • Other: pods deployment through Helm
@mitsos1os mitsos1os added the kind/bug Categorizes issue or PR as related to a bug. label Nov 26, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 26, 2020
@k8s-ci-robot
Copy link
Contributor

@mitsos1os: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 26, 2020
@neolit123
Copy link
Member

/sig node
for triage

Kubernetes version (use kubectl version): 1.16.7

note this version is no longer supported. you need to upgrade to 1.18.x soon.

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 30, 2020
@mitsos1os
Copy link
Author

@neolit123 I am aware of that, but since upgrading would require an extra effort for integrating breaking changes, we were hoping to resolve this matter since it seems quite problematic as behavior

@mitsos1os
Copy link
Author

It actually ended up being an issue with some in-app logging directly to FluentD DaemonSet.
You can see the issue here: fluent/fluentd#3202
Closing this as not Kubernetes related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

3 participants