Skip to content
Matthias Schiffer edited this page Mar 7, 2018 · 7 revisions

Introduction

On embedded devices we are regularly dealing with a very limited amount of flash and RAM. This page serves the purpose of helping with and tracking the status of the latter, RAM issues.

The motivation of this page is ticket #1243 in particular.

Helpful knowledge, links and articles

Collect any external things that might help other people to understand and debug OOM issues here.

  • [Add some links here, explaining userspace vs. kernelspace allocations, VIRT/RES/SHR, pages, slabs, kmalloc(), kmem_cache_alloc(), vmalloc(), /proc/slabinfo, /proc/vmallocinfo, /proc/vmstat, echo 'm' > /proc/sysrq-trigger,...]

How to Debug

  • Build Gluon with
  • On OOM and after reboot, get crash report from /sys/kernel/debug/crashlog
  • Try to find a reproducable, isolated setup!
  • Observe:
    • /proc/slabinfo
    • /proc/vmallocinfo
    • /proc/vmstat
    • echo 'm' > /proc/sysrq-trigger; dmesg
    • /sys/kernel/debug/ieee80211/phy0/aqm
  • Helpful tools:
    • Traffic monitoring: tcpdump, wireshark, etc.
    • Traffic generators: mausezahn, iperf, tcpreplay, etc.
  • ...
  • Profit

Current Issues, Observations and Status

Out-of-memory due to kernel allocations

Status: Unsolved

Issue: OOM due to allocations in kernelspace.

Related tickets: #1243, #1306, #1197

How to trigger: In networks with a high number of nodes?


Observations so far:

  • First observed after the first Gluon releases based on LEDE
  • Nothing suspicious in /proc/slabinfo on crash
    • Seems to outrule the Linux bridge or batman-adv as a potential causes
  • Setting 'echo fq_memory_limit 200 > /sys/kernel/debug/ieee80211/phy0/aqm' (seemingly?) had a positive effect

Tasks:

  • Finding a setup to reproduce the issue in an isolated configuration.

NeoRaider's test wishlist:

  1. Observe reported memory usage in /sys/kernel/debug/ieee80211/phy0/aqm before crash
  2. Check if OOM is reproducible with all WLAN disabled (both mesh and AP), but active VPN (and possibly wired mesh)
  3. On a node with WLAN mesh only:
    1. Check if crash is reproducible with disabled AP
    2. Test different values for fq_memory_limit, can be set in /etc/hotplug.d/ieee80211/01-gluon-core-codel-memusage. What is the highest value that fixes crashes reliably
    3. Like ii., but with disabled AP

All of these tests should be done on both the master and the next branch of Gluon.

OOM on IP Fragments

Status: Solved (unreleased)

Issue: IPv4+v6 fragmentation buffers may buffer packets of up to a size of 8MB in total (4MB per address family)

Related tickets: -

How to trigger: An OOM was easily triggered via iperf3 running on a node, if packets were fragmented ($ iperf3 -l 1500). However should potentially be triggerable with no extra tools on the node and just external traffic, too?


Mitigated in latest master and v2017.1.x. Additional firewall rules might be considered, too.

Archived Issues

OOM on accessing transtable_global via debugfs

Status: Solved

Issue: In setups involving ~2500 client devices, nodes crashed frequently. The issue was alfred and respondd accessing the global batman-adv translation table via debugfs which caused high-order memory allocations due to the large table size.

Related tickets: #753

How to trigger: Spawn >2500 client devices, then 'cat /sys/kernel/debug/batman_adv/bat0/transtable_global'


The issue was fixed by implementing a netlink based interface in batman-adv and using that for alfred and respondd to access the global batman-adv translation table.

Clone this wiki locally