Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken DNS over WAN / missing dnsmasq after quick link downs+ups #2950

Open
T-X opened this issue Aug 28, 2023 · 9 comments
Open

Broken DNS over WAN / missing dnsmasq after quick link downs+ups #2950

T-X opened this issue Aug 28, 2023 · 9 comments
Labels
0. type: bug This is a bug
Milestone

Comments

@T-X
Copy link
Contributor

T-X commented Aug 28, 2023

Bug report

What is the problem?

  • What is not working as expected?
    • fastd is unable to resolve hostnames and therefore unable to connect
    • "gluon-wan ping google.com" is not working, while "ping 141.1.1.1" works fine over WAN
  • How is it misbehaving?
    • The dnsmasq instance that is supposed to listen on port 54 is not running:
root@nml-pa2200:~# netstat -tulpen | grep :54
udp        0      0 :::546                  :::*                                7382/odhcp6c
udp        0      0 :::546                  :::*                                1971/odhcp6c
root@nml-pa2200:~# ps | grep dnsmasq
  886 root      2396 S    {dnsmasq} /sbin/ujail -t 5 -n dnsmasq -u -l -r /bin/ubus -r /etc/TZ -r /etc/dnsmasq.conf -r /etc/ethers -r  
  939 dnsmasq   1280 S    /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq/dnsmasq.cfg01411c.pid
28289 root      1144 S    grep dnsmasq
root@nml-pa2200:~#
  • When did the problem first start showing up?
  • What were you doing when you first noticed the problem?
    • Using the Freifunk network
  • On which devices (vendor, model and revision) is it misbehaving?
  • Does the issue appear on multiple devices or targets?
    • untested
  • Workarounds?
    • the device works fine again after rebooting it, netstat then looks ok:
root@nml-pa2200:~# netstat -tulpen | grep :54
tcp        0      0 0.0.0.0:54              0.0.0.0:*               LISTEN      2448/dnsmasq
tcp        0      0 :::54                   :::*                    LISTEN      2448/dnsmasq
udp        0      0 0.0.0.0:54              0.0.0.0:*                           2448/dnsmasq
udp        0      0 :::546                  :::*                                2349/odhcp6c
udp        0      0 :::546                  :::*                                2010/odhcp6c
udp        0      0 :::54                   :::*                                2448/dnsmasq
root@nml-pa2200:~# ps | grep dnsmasq
  886 root      2396 S    {dnsmasq} /sbin/ujail -t 5 -n dnsmasq -u -l -r /bin/ubus -r /etc/TZ -r /etc/dnsmasq.conf -r /etc/ethers -r 
  932 dnsmasq   1268 S    /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq/dnsmasq.cfg01411c.pid
 2544 root      1268 S    /usr/sbin/dnsmasq -x /var/run/gluon-wan-dnsmasq.pid -u root -i lo -p 54 -h -r /var/gluon/wan-dnsmasq/resolv
 4538 root      1144 S    grep dnsmasq
root@nml-pa2200:~# 

What is the expected behaviour?

  • How do you think it should work instead?
    • after a link went down and up again then not only routing over WAN but also DNS over WAN should work again
  • Did it work like that before?
    • unknown

Gluon Version:

  • v2022.1

Site Configuration:

root@nml-pa2200:~# cat /lib/gluon/release 
0.16.3~exp20220914
root@nml-pa2200:~#

Custom patches:

  • none

update1:

  • added "ps" output
@T-X T-X added the 0. type: bug This is a bug label Aug 28, 2023
@dariks
Copy link

dariks commented Sep 5, 2023

We are seeing this very same behaviour with our nodes too. They are running on Gluon v2022.1.2 and Gluon v2022.1.3 with Tunneldigger L2TP Mesh-VPN enabled.
For us it seems to be happening when a node without Mesh Neighbors is on a WAN connection with Mesh-VPN enabled and the WAN connection is unavailable for a longer time. We are seeing this issue when a ISP Router is updating and/or if there is an outage with that provider.
Our nodes reboot every night and this seems to fix the issue every time.

What is even worse: In our testing Gluon v2023.1 is running from this state even on a fresh boot.

@blocktrron
Copy link
Member

Can you test if #2969 mitigates the issue you are seeing?

@T-X
Copy link
Contributor Author

T-X commented Sep 10, 2023

A little update and more background info:

The issue was caused due to power/undervoltage issues of a Mikrotik RB260GSP switch (DC-in range: 11V-30V), which caused it to reboot frequently whenever a cooling box's compressor on the same 12V power supply started. The RB260GSP is directly connected to the Plasmacloud PA2200 running Gluon with two LAN cables.

The RB260GSP is now on a separate 24V power supply. But I can still reproduce the Gluon issue by just disconnecting the LAN cable to the PA2200's WAN port. After about 1-5 minutes dnsmasq reproducibly segfaults:

root@nml-pa2200:~# /etc/init.d/gluon-wan-dnsmasq stop
root@nml-pa2200:~# /usr/sbin/dnsmasq -d -x /var/run/gluon-wan-dnsmasq.pid -u root -i lo -p 54 -h -r /var/gluon/wan-dnsmasq/resolv.conf
--log-facility=/tmp/gluon-wan-dnsmasq.log --log-debug
dnsmasq: started, version 2.86 cachesize 150
dnsmasq: compile time options: IPv6 GNU-getopt no-DBus UBus no-i18n no-IDN DHCP no-DHCPv6 no-Lua TFTP no-conntrack no-ipset no-auth no-cryptohash no-DNSSEC no-ID loop-detect inotify dumpfile
dnsmasq: reading /var/gluon/wan-dnsmasq/resolv.conf
dnsmasq: using nameserver fe80::1%br-wan#53
dnsmasq: using nameserver 192.168.2.1#53
dnsmasq: cleared cache
dnsmasq: no servers found in /var/gluon/wan-dnsmasq/resolv.conf, will retry
Segmentation fault
root@nml-pa2200:~# date 
Sun Sep 10 01:55:21 CEST 2023
root@nml-pa2200:~# cat /tmp/gluon-wan-dnsmasq.log
Sep 10 01:51:26 dnsmasq[6583]: started, version 2.86 cachesize 150
Sep 10 01:51:26 dnsmasq[6583]: compile time options: IPv6 GNU-getopt no-DBus UBus no-i18n no-IDN DHCP no-DHCPv6 no-Lua TFTP no-conntrac
Sep 10 01:51:26 dnsmasq[6583]: reading /var/gluon/wan-dnsmasq/resolv.conf
Sep 10 01:51:26 dnsmasq[6583]: using nameserver fe80::1%br-wan#53
Sep 10 01:51:26 dnsmasq[6583]: using nameserver 192.168.2.1#53
Sep 10 01:51:26 dnsmasq[6583]: cleared cache
Sep 10 01:51:38 dnsmasq[6583]: no servers found in /var/gluon/wan-dnsmasq/resolv.conf, will retry

Original content of /var/gluon/wan-dnsmasq/resolv.conf (before disconnecting the LAN cable, becomes a 0 bytes file right after disconnecting):

root@nml-pa2200:~# cat /tmp/blabla-resolv.conf 
nameserver fe80::1%br-wan
nameserver 192.168.2.1

I also tried to reproduce the issue from an x86_64 qemu KVM instance. I can set the WAN port's carrier to NO-CARRIER via qemu monitor's set_link command. However dnsmasq does not seem to crash in there. (Maybe it's necessary to have a DHCP server connected to the WAN side to hand out IP addresses, routes and DNS servers, which I didn't have in my qemu tests yet.)

@blocktrron #2969 seems to indeed mitigate the issue, procd restarts dnsmasq successfully then after dnsmasq's segfault.

@T-X
Copy link
Contributor Author

T-X commented Sep 10, 2023

Also here's a core dump of dnsmasq from this device after the segfault. Though probably quite useless with the missing debug symbols etc. (at least I can't get anything useful out of it with gdb right now and it does not want to generate a backtrace for me): https://speicher.hamburg.freifunk.net/d/b6c6c1af9fb341cfbc32/

@neocturne
Copy link
Member

@T-X Can you try to generate a backtrace with a build for which you have the unstripped dnsmasq and libc binaries?

@blocktrron
Copy link
Member

@T-X good to know, although it is not a proper fix by all means

@T-X
Copy link
Contributor Author

T-X commented Sep 11, 2023

Here's a core dump made with the unstripped dnsmasq (and hopefully with the unstripped glibc):

https://speicher.hamburg.freifunk.net/d/92a69e30cc0d478b8a3a/

However, I'm still unsuccessful to get a backtrace from it via gdb (but at least it's not complaining about missing register info anymore):

$ ./build_dir/toolchain-arm_cortex-a7+neon-vfpv4_gcc-11.2.0_musl_eabi/gdb-11.2/gdb/gdb --exec=/home/linus/dev-priv/gluon-misc/dnsmasq-unstripped --core=/home/linus/dev-priv/gluon-misc/dnsmasq.1694403215.4419.11.core
GNU gdb (GDB) 11.2
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "--host=x86_64-pc-linux-gnu --target=arm-openwrt-linux-muslgnueabi".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".

warning: Can't open file /root/dnsmasq during file-backed mapping note processing

warning: Can't open file /lib/libgcc_s.so.1 during file-backed mapping note processing

warning: Can't open file /lib/libubus.so.20220601 during file-backed mapping note processing

warning: Can't open file /lib/libubox.so.20220515 during file-backed mapping note processing

warning: Can't open file /lib/libc.so during file-backed mapping note processing

warning: Can't open file /tmp/TZ during file-backed mapping note processing

warning: core file may not match specified executable file.
[New LWP 4419]
Core was generated by `./dnsmasq -d -x /var/run/gluon-wan-dnsmasq.pid -u root -i lo -p 54 -h -r /var/g'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0050c790 in ?? ()
(gdb) bt
#0  0x0050c790 in ?? ()
#1  0x0050c870 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) 

I had copied build_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/dnsmasq-nodhcpv6/dnsmasq-2.86/src/dnsmasq to pa2200:/root/dnsmasq and build_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/toolchain/ipkg-arm_cortex-a7_neon-vfpv4/libc/lib/libc.so to pa2200:/root/libs/libc.so. And then used "cd /root/; LD_LIBRARY_PATH=/root/libs ./dnsmasq ..." to call it.


Edit: libc.so was stripped. This should be with an unstripped one. Still same issues with getting a backtrace from gdb: https://speicher.hamburg.freifunk.net/d/de1e7a79a92c49b8acf6/

@blocktrron
Copy link
Member

The files you've provided produce this backtrace for me:

(gdb) bt full
#0  0x0050c790 in order_servers.lto_priv ()
No symbol table info available.
#1  0x0050c870 in filter_servers ()
No symbol table info available.
#2  0x004f6340 in forward_query.lto_priv ()
No symbol table info available.
#3  0xbec02a30 in ?? ()
No symbol table info available.
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

@neocturne neocturne added this to the v2023.2 milestone Oct 10, 2023
@blocktrron blocktrron modified the milestones: v2023.2, v2024.1 Dec 10, 2023
@blocktrron
Copy link
Member

Moving this to the next milestone, as a workaround by using procd has been implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. type: bug This is a bug
Projects
None yet
Development

No branches or pull requests

4 participants