Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when collecting sosreport from live environment: Could not enumerate network devices: [Errno 2] No such file or directory: '/mnt/sys/class/net' #3307

Open
pjmattingly opened this issue Jul 18, 2023 · 22 comments · Fixed by #3313

Comments

@pjmattingly
Copy link

Scenario: Writing a KB on a possible scenario where a customer has a system that does not boot normally, but can be booted into a live environment. For troubleshooting it is desirable to collect information from the host system, while in the live environment.

The issue was observed when the live environment was booted, the host root partition was mounted on /mnt and sos report was used in following forms:

sudo sos report -a --all-logs --sysroot=/mnt --chroot=always --estimate-only
sudo sos report -a --all-logs --sysroot=/mnt --estimate-only
sudo sos report -a --all-logs --sysroot=/mnt --chroot=always
sudo sos report -a --all-logs --sysroot=/mnt

Which resulted in the error:

Could not enumerate network devices: [Errno 2] No such file or directory: '/mnt/sys/class/net'

Looking more closely at the /mnt/sys/class/net directory, it was found that the /mnt/sys directory was empty. This is to be expected when booting from a live environment.

Expected behaviour: I would expect that sosreport might error on finding /mnt/sys empty, and exit. Or continue with some other workaround; For example, falling back to using the live-system kernel. But when executing the above commands sos hangs and will not exit unless killed.

Additional information:

Live environment:

uname -a
Linux ubuntu-server 5.15.0-60-generic #66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
df -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           393M  1.3M  392M   1% /run
/dev/sr0        1.9G  1.9G     0 100% /cdrom
/cow            2.0G  162M  1.8G   9% /
overlay         321M  321M     0 100% /media/filesystem
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           2.0G  4.7M  2.0G   1% /tmp
/dev/vda2        25G  6.8G   17G  30% /mnt
tmpfs           393M  4.0K  393M   1% /run/user/1001

Host environment:

uname -a
Linux kbtestmachine 5.15.0-76-generic #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Host is a VM created through lib-virt.

@TurboTurtle
Copy link
Member

I'm not sure what the ask is here?

Does sos run from that point on, and just not collect network device information? Does it spew out a traceback and exit?

You mention you'd expect it to exit but it's not clear what the behavior you're seeing is after the error. If it exits on that error, doesn't that match your expectation?

@pjmattingly
Copy link
Author

Hello,

Does sos run from that point on, and just not collect network device information?

It hangs. It shows this message and doesn't proceed further:

Could not enumerate network devices: [Errno 2] No such file or directory: '/mnt/sys/class/net'

Does it spew out a traceback and exit?

No traceback is shown, and sos does not exit.

@TurboTurtle
Copy link
Member

Ah, ok.

It looks like that is percolating up from _get_eth_devs(), where we're looking in sysroot via a wrapper for os.listdir(). There is an easy fix here of sticking that in a try block, but we could also fallback to not utilizing sysroot before abandoning the inspection.

I'm working on something locally and hope to have a PR for testing/review before too long.

@TurboTurtle
Copy link
Member

Although, that being said - I'm not sure why it is hanging on you.

Can you post the results (pastebin ideally) of sos report -vvv for this scenario?

@pjmattingly
Copy link
Author

Hello,

I attempted the simplest command with -vvv, strangely the output was not very large:

peter@ubuntu-server:~$ sudo sos report -vvv -a --all-logs --sysroot=/mnt
WARNING: tmp-dir is set to a tmpfs filesystem. This may increase memory pressure and cause instability on low memory systems, or when using --all-logs.

sosreport (version 4.4)

set sysroot to '/mnt' (cmdline)
Network devices not enumerated by nmcli. Will attempt to manually compile list of devices.
Could not enumerate network devices: [Errno 2] No such file or directory: '/mnt/sys/class/net'

Is that enough information to proceed? I can provide more information on my test setup if that would be helpful.

@TurboTurtle
Copy link
Member

Wow, ok...that's not what I was expecting, but it does at least confirm it's within _get_eth_devs() I think.

I'm going to try and have a PR up later today for testing that should avoid the exception that gets trapped and prints that message - which in turn should hopefully break us out of this. I still don't know why it's hanging on you there, though.

@pjmattingly
Copy link
Author

That sounds good. I do get alerts when this gets replies, so let me know if I can help. =/

TurboTurtle added a commit to TurboTurtle/sos that referenced this issue Jul 19, 2023
If sos is being used in a live environment to diagnose an issue, using
sysroot can cause the network device enumeration via /sys/class/net
crawling to fail. This will be the case for systems that do not use
`nmcli`.

When in a live environment, network devices will not be under
`/$sysroot/sys/class/net` but the "regular" path for the booted
environment. Similarly, if sos is being run in a container that is
properly configured, network devices will appear under `/sys/class/net`
and not (necessarily) under the sysroot path that mounts the host's
filesystem.

As such, disregard a configured sysroot when enumerating network devices
by crawling `/sys/class/net`, and trap any exceptions that may percolate
up from this in edge case environments.

Closes: sosreport#3307

Signed-off-by: Jake Hunsaker <jacob.r.hunsaker@gmail.com>
@TurboTurtle
Copy link
Member

I've opened #3313 for this. Upon checking further, I don't believe there's an actual use case where using sysroot for this check would actually be valid, so I've removed it entirely so we should always check the "regular" /sys/class/net location for this kind of enumeration, and added a catch just in case.

Please give this a try, and let us know if this resolves your scenario. If so, we can likely include this in 4.5.6 which is closing tomorrow.

@pjmattingly
Copy link
Author

Okay, I'll add that to my TODO list. Thank you. =)

@TurboTurtle TurboTurtle added this to the 4.6.0 milestone Jul 21, 2023
@pjmattingly
Copy link
Author

Sorry for the delay.

I ran the PR, the error is gone, but sos still hangs.

Steps:

  1. Download https://github.com/TurboTurtle/sos/archive/dc190a9fde94f575e660e782831d922b89c00507.zip
  2. unzip, and run: sudo ./bin/sos report -a --all-logs --sysroot=/mnt
  3. output:
peter@ubuntu-server:~/sos-dc190a9fde94f575e660e782831d922b89c00507$ sudo ./bin/sos report -a --all-logs --sysroot=/mnt
WARNING: tmp-dir is set to a tmpfs filesystem. This may increase memory pressure and cause instability on low memory systems, or when using --all-logs.

sosreport (version 4.5.5)

Then, trying again with -vvv:

peter@ubuntu-server:~/sos-dc190a9fde94f575e660e782831d922b89c00507$ sudo ./bin/sos report -vvv -a --all-logs --sysroot=/mnt
WARNING: tmp-dir is set to a tmpfs filesystem. This may increase memory pressure and cause instability on low memory systems, or when using --all-logs.

sosreport (version 4.5.5)

set sysroot to '/mnt' (cmdline)
Network devices not enumerated by nmcli. Will attempt to manually compile list of devices.

Which also hangs.

Then, I tried to strace to see if I could spot a problem (see: https://pastebin.ubuntu.com/p/DkGwVqKgfC/ [note this is set to expire in a month]):

root@ubuntu-server:/home/peter/sos-dc190a9fde94f575e660e782831d922b89c00507# strace ./bin/sos report -vvv -a --all-logs --sysroot=/mnt 2> ../strace_sos.txt

It starts to hang here:

futex(0x5643c96803c0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = -1 EAGAIN (Resource temporarily unavailable)
wait4(2849, 0x7ffe2f675ddc, WNOHANG, NULL) = 0
pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=1000000}, NULL) = 0 (Timeout)

So it looks like it's not releasing a mutex properly? At any rate, I hope that's helpful.

Take care.

--P

@pmoravec
Copy link
Contributor

With my limited knowledge of strace, I think a child process (one of four collecting data from plugins?) gets stuck:

clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f3d7a1e52d0) = 2849
..
wait4(2849, 0x7ffe2f675ddc, WNOHANG, NULL) = 0
pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=1000000}, NULL) = 0 (Timeout)
wait4(2849, 0x7ffe2f675ddc, WNOHANG, NULL) = 0
pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=2000000}, NULL) = 0 (Timeout)

We lack timestamps, but the Timeout suggests the cloned thread (not collected by strace) started to collect some command under timeout (like timeout 300 abrt-cli status) and that hungs. Or maybe setting up plugins and e.g. generating list of network namespaces (which also triggers similar commands under timeout) get stuck..? (but that imho does not run under child process..?).

The hung sosreport should generate a directory /var/tmp/sos* with some tmp file there that would be renamed to sos_logs/sos.log at the end, with content starting like:

2023-07-28 08:49:52,074 DEBUG: set sysroot to '/' (default)
2023-07-28 08:49:52,694 INFO: [sos.report:setup] executing 'sos report -o kernel,qpid,abrt --batch --build'

Could you please provide us that file to understand the phase of sosreport run when it got stuck? Or ideally re-run with better strace:

strace -fttTxyC -o ../strace_sos.txt ./bin/sos report -vvv -a --all-logs --sysroot=/mnt

to have strace with timestamps and also child processes (some strace options are redundant here but shouldnt harm)and provide strace and the temp file with sos.log content?

@pjmattingly
Copy link
Author

Okay, schedule permitting, I'll carry out those steps.

Thanks.

@pjmattingly
Copy link
Author

pjmattingly commented Jul 28, 2023

As requested I ran the command and killed the process after letting it hang for a few moments:

root@ubuntu-server:/home/peter/sos-dc190a9fde94f575e660e782831d922b89c00507# strace -fttTxyC -o ../strace_sos.txt ./bin/sos report -vvv -a --all-logs --sysroot=/mnt
WARNING: tmp-dir is set to a tmpfs filesystem. This may increase memory pressure and cause instability on low memory systems, or when using --all-logs.

sosreport (version 4.5.5)

set sysroot to '/mnt' (cmdline)
Network devices not enumerated by nmcli. Will attempt to manually compile list of devices.
^Z
[2]+  Stopped                 strace -fttTxyC -o ../strace_sos.txt ./bin/sos report -vvv -a --all-logs --sysroot=/mnt

The hung sosreport should generate a directory /var/tmp/sos* ...

Then, strangely, I could not find the temp files...

root@ubuntu-server:/var/tmp# ls
systemd-private-1b36dcbf8a654bd495d6e8382c9a03d1-ModemManager.service-sVaVOs
systemd-private-1b36dcbf8a654bd495d6e8382c9a03d1-systemd-logind.service-FUrxp5
systemd-private-1b36dcbf8a654bd495d6e8382c9a03d1-systemd-resolved.service-2zA94X
systemd-private-1b36dcbf8a654bd495d6e8382c9a03d1-systemd-timesyncd.service-v2IpKi

But I did find some in /tmp that looked promising:

root@ubuntu-server:/tmp# ls
snap-private-tmp
sos.erz9jnr4
systemd-private-1b36dcbf8a654bd495d6e8382c9a03d1-ModemManager.service-hsp3BD
systemd-private-1b36dcbf8a654bd495d6e8382c9a03d1-systemd-logind.service-L9wINg
systemd-private-1b36dcbf8a654bd495d6e8382c9a03d1-systemd-resolved.service-WTMwPV
systemd-private-1b36dcbf8a654bd495d6e8382c9a03d1-systemd-timesyncd.service-m785GG
tmp50b0j6zu

Though looking at the content I'm not sure they will be helpful:

root@ubuntu-server:/tmp# ls sos.erz9jnr4/
tmp5ab75i_5  tmpbae578d6
root@ubuntu-server:/tmp# cd sos.erz9jnr4/
root@ubuntu-server:/tmp/sos.erz9jnr4# ls
tmp5ab75i_5  tmpbae578d6
root@ubuntu-server:/tmp/sos.erz9jnr4# cat tmp5ab75i_5
2023-07-28 21:28:26,514 DEBUG: set sysroot to '/mnt' (cmdline)
2023-07-28 21:28:26,545 DEBUG: Network devices not enumerated by nmcli. Will attempt to manually compile list of devices.
root@ubuntu-server:/tmp/sos.erz9jnr4# cat tmpbae578d6
root@ubuntu-server:/tmp/sos.erz9jnr4#

The strace file was generated without issue and can be found here.

TurboTurtle added a commit to TurboTurtle/sos that referenced this issue Jul 29, 2023
If sos is being used in a live environment to diagnose an issue, using
sysroot can cause the network device enumeration via /sys/class/net
crawling to fail. This will be the case for systems that do not use
`nmcli`.

When in a live environment, network devices will not be under
`/$sysroot/sys/class/net` but the "regular" path for the booted
environment. Similarly, if sos is being run in a container that is
properly configured, network devices will appear under `/sys/class/net`
and not (necessarily) under the sysroot path that mounts the host's
filesystem.

As such, disregard a configured sysroot when enumerating network devices
by crawling `/sys/class/net`, and trap any exceptions that may percolate
up from this in edge case environments.

Related: sosreport#3307

Signed-off-by: Jake Hunsaker <jacob.r.hunsaker@gmail.com>
@pmoravec
Copy link
Contributor

Sigh, I assumed some more file content in the tmp* files. Buts since they are almost empty and you haven't even got the

This command will collect system configuration and diagnostic information
..

text, sos got stuck at early stage around https://github.com/sosreport/sos/blob/main/sos/report/__init__.py#L1785-L1791 lines.

If my understanding of strace is right, then process 2946 got stuck on an epoll from socket communicating with 2947 that got stuck - and it happened after 2946 successfully(?) read /etc/os-release which is a symlink to /mnt/usr/lib/os-release. But no idea what sos was happening there.. /o\

Until @TurboTurtle got a different idea, could you get sos (python) backtrace like:

    f=Frame 0x55ea3174f358, for file /root/sos-main/sos/report/__init__.py, line 1172, in batch 
    f=Frame 0x55ea3160bc08, for file /root/sos-main/sos/report/__init__.py, line 1810, in execute 
    f=Frame 0x7fcdc8a39be0, for file /root/sos-main/sos/__init__.py, line 193, in execute 

(this happens when sos is waiting on ENTER prompt), either by running sos under gdb directly and pausing it when it got stuck, or running sos normally and when stuck, run gdb -p <pid> /path/to/binary/python and then "py" ? (assuming you have python debuginfo installed and you have to select ".py," lines from the mostly-C-backtrace)

Also @TurboTurtle : does it make sense to add some debugs to this pre-setup phase, to diagnose this type of issues more easily the next time? (or is this issue too sole to sacrifise microseconds of each and every sos run for that?)

@TurboTurtle
Copy link
Member

I'm not opposed to more debug logging, but I'm curious what would be helpful here. Also, the fact that this only occurs in a rescue environment is puzzling. I don't have a better idea off the top of my head than drilling down with a coredump, unfortunately.

TurboTurtle added a commit that referenced this issue Jul 31, 2023
If sos is being used in a live environment to diagnose an issue, using
sysroot can cause the network device enumeration via /sys/class/net
crawling to fail. This will be the case for systems that do not use
`nmcli`.

When in a live environment, network devices will not be under
`/$sysroot/sys/class/net` but the "regular" path for the booted
environment. Similarly, if sos is being run in a container that is
properly configured, network devices will appear under `/sys/class/net`
and not (necessarily) under the sysroot path that mounts the host's
filesystem.

As such, disregard a configured sysroot when enumerating network devices
by crawling `/sys/class/net`, and trap any exceptions that may percolate
up from this in edge case environments.

Related: #3307

Signed-off-by: Jake Hunsaker <jacob.r.hunsaker@gmail.com>
@TurboTurtle
Copy link
Member

This should not have been closed, re-opening. I'm guessing the original Closes tag doesn't get updated with a commit message update.

@TurboTurtle TurboTurtle reopened this Jul 31, 2023
@pjmattingly
Copy link
Author

@pmoravec

Hi Pavel, just to clarify, did you want me to try this part?

...could you get sos (python) backtrace like:

    f=Frame 0x55ea3174f358, for file /root/sos-main/sos/report/__init__.py, line 1172, in batch 
    f=Frame 0x55ea3160bc08, for file /root/sos-main/sos/report/__init__.py, line 1810, in execute 
    f=Frame 0x7fcdc8a39be0, for file /root/sos-main/sos/__init__.py, line 193, in execute 

(this happens when sos is waiting on ENTER prompt), either by running sos under gdb directly and pausing it when it got stuck, or running sos normally and when stuck, run gdb -p <pid> /path/to/binary/python and then "py" ? (assuming you have python debuginfo installed and you have to select ".py," lines from the mostly-C-backtrace).

Thanks.

@pmoravec
Copy link
Contributor

Hello,
yes, we would like to see Python backtrace of the stuck sos process - in either way. We ask you to either:

  • (iterativelly?) add multiple debugs to understand what particular block of code is sos stuck,
  • or generate coredump (e.g. via gcore PID /path/to/binary/python) and get Python backtrace by analyzing it,
  • or attach via gdb (run gdb -p PID /path/to/binary/python and then bt to see full backtrace - I recommend running set logging on to print output to a file)
  • or run whole sos under gdb from the start (gdb /path/to/binary/python, then set args ./bin/sos report -a --all-logs and then run; once it gets stuck, Ctrl+C and bt to see backtrace)

Last two options require debuginfo packages installed (usually just python and glibc suffice, might depend on distro).

@pjmattingly
Copy link
Author

pjmattingly commented Aug 1, 2023

Here's one part:

or run whole sos under gdb from the start (gdb /path/to/binary/python, then set args ./bin/sos report -a --all-logs and then run; once it gets stuck, Ctrl+C and bt to see backtrace)
  1. gdb /usr/bin/python3.10
  2. set args ./bin/sos report -a --all-logs --sysroot=/mnt
  3. set logging enabled on #log to gdb.txt
  4. run
  5. then when it hangs: ctrl+c
  6. bt
root@ubuntu-server:/home/peter/sos-dc190a9fde94f575e660e782831d922b89c00507# cat gdb.txt

Starting program: /usr/bin/python3.10 ./bin/sos report -a --all-logs --sysroot=/mnt
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after vfork from child process 6804]
[Detaching after vfork from child process 6805]
[Detaching after fork from child process 6806]
[New Thread 0x7ffff5c02640 (LWP 6807)]
[Thread 0x7ffff5c02640 (LWP 6807) exited]
[Detaching after vfork from child process 6809]
[Detaching after fork from child process 6810]
[New Thread 0x7ffff5c02640 (LWP 6811)]
[Thread 0x7ffff5c02640 (LWP 6811) exited]
[Detaching after fork from child process 6812]
[New Thread 0x7ffff5c02640 (LWP 6813)]
[Thread 0x7ffff5c02640 (LWP 6813) exited]
[Detaching after fork from child process 6815]
[New Thread 0x7ffff5c02640 (LWP 6816)]
[Thread 0x7ffff5c02640 (LWP 6816) exited]
[Detaching after fork from child process 6817]
[New Thread 0x7ffff5c02640 (LWP 6818)]
[Thread 0x7ffff5c02640 (LWP 6818) exited]
[Detaching after fork from child process 6820]
[New Thread 0x7ffff5c02640 (LWP 6821)]
[Thread 0x7ffff5c02640 (LWP 6821) exited]
[Detaching after fork from child process 6823]
[New Thread 0x7ffff5c02640 (LWP 6824)]
[Thread 0x7ffff5c02640 (LWP 6824) exited]
[Detaching after fork from child process 6826]
[New Thread 0x7ffff5c02640 (LWP 6827)]
[Thread 0x7ffff5c02640 (LWP 6827) exited]
[Detaching after fork from child process 6829]
[New Thread 0x7ffff5c02640 (LWP 6830)]
[Thread 0x7ffff5c02640 (LWP 6830) exited]
[Detaching after fork from child process 6832]
[New Thread 0x7ffff5c02640 (LWP 6833)]

Thread 1 "python3.10" received signal SIGINT, Interrupt.
0x00007ffff7d747ed in __GI___select (nfds=0, readfds=0x0, writefds=0x0, exceptfds=0x0, timeout=0x7fffffffbeb0) at ../sysdeps/unix/sysv/linux/select.c:69
69      ../sysdeps/unix/sysv/linux/select.c: No such file or directory.
#0  0x00007ffff7d747ed in __GI___select (nfds=0, readfds=0x0, writefds=0x0, exceptfds=0x0, timeout=0x7fffffffbeb0)
    at ../sysdeps/unix/sysv/linux/select.c:69
#1  0x00005555557d1b06 in ?? ()
#2  0x00005555556af564 in ?? ()
#3  0x000055555569ea72 in _PyEval_EvalFrameDefault ()
#4  0x00005555556be391 in ?? ()
#5  0x000055555569a2fc in _PyEval_EvalFrameDefault ()
#6  0x00005555556b03ac in _PyFunction_Vectorcall ()
#7  0x000055555569914a in _PyEval_EvalFrameDefault ()
#8  0x00005555556b03ac in _PyFunction_Vectorcall ()
#9  0x000055555569a2fc in _PyEval_EvalFrameDefault ()
#10 0x00005555556be391 in ?? ()
#11 0x000055555569a2fc in _PyEval_EvalFrameDefault ()
#12 0x00005555556b03ac in _PyFunction_Vectorcall ()
#13 0x000055555569914a in _PyEval_EvalFrameDefault ()
#14 0x00005555556bbb6e in ?? ()
#15 0x00005555556ae4a8 in _PyObject_GenericGetAttrWithDict ()
#16 0x00005555556acabb in PyObject_GetAttr ()
#17 0x000055555569e713 in _PyEval_EvalFrameDefault ()
#18 0x00005555556be4de in ?? ()
#19 0x000055555569b3b0 in _PyEval_EvalFrameDefault ()
#20 0x00005555556b03ac in _PyFunction_Vectorcall ()
#21 0x000055555569ea72 in _PyEval_EvalFrameDefault ()
#22 0x00005555556b03ac in _PyFunction_Vectorcall ()
#23 0x000055555569914a in _PyEval_EvalFrameDefault ()
#24 0x00005555556cd5d2 in ?? ()
#25 0x00005555557223a6 in ?? ()
#26 0x00005555556af564 in ?? ()
#27 0x0000555555699005 in _PyEval_EvalFrameDefault ()
#28 0x00005555556b03ac in _PyFunction_Vectorcall ()
#29 0x000055555569914a in _PyEval_EvalFrameDefault ()
#30 0x00005555556b03ac in _PyFunction_Vectorcall ()
#31 0x000055555569914a in _PyEval_EvalFrameDefault ()
#32 0x00005555556b03ac in _PyFunction_Vectorcall ()
#33 0x000055555569914a in _PyEval_EvalFrameDefault ()
#34 0x00005555556b03ac in _PyFunction_Vectorcall ()
#35 0x000055555569914a in _PyEval_EvalFrameDefault ()
#36 0x00005555556b03ac in _PyFunction_Vectorcall ()
#37 0x000055555569914a in _PyEval_EvalFrameDefault ()
#38 0x00005555556b03ac in _PyFunction_Vectorcall ()
#39 0x000055555569914a in _PyEval_EvalFrameDefault ()
#40 0x0000555555695766 in ?? ()
#41 0x000055555578d456 in PyEval_EvalCode ()
#42 0x00005555557b9f08 in ?? ()
#43 0x00005555557b2d5b in ?? ()
#44 0x00005555557b9c55 in ?? ()
#45 0x00005555557b9138 in _PyRun_SimpleFileObject ()
#46 0x00005555557b8e33 in _PyRun_AnyFileObject ()
#47 0x00005555557aa0ae in Py_RunMain ()
#48 0x000055555578034d in Py_BytesMain ()
#49 0x00007ffff7c82d90 in __libc_start_call_main (main=main@entry=0x555555780310, argc=argc@entry=6, argv=argv@entry=0x7fffffffe5a8)
    at ../sysdeps/nptl/libc_start_call_main.h:58
#50 0x00007ffff7c82e40 in __libc_start_main_impl (main=0x555555780310, argc=6, argv=0x7fffffffe5a8, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7fffffffe598) at ../csu/libc-start.c:392
#51 0x0000555555780245 in _start ()

@pjmattingly
Copy link
Author

Then the core dump:

or attach via gdb (run gdb -p PID /path/to/binary/python and then bt to see full backtrace - I recommend running set logging on to print output to a file)

  1. ./bin/sos report -a --all-logs --sysroot=/mnt
  2. ps aux | grep sos
  3. gcore <PID>

see:
https://drive.google.com/file/d/1azE_Pq7AOQ23fj6xAMYMUWOjLLFy8YZ_/view?usp=drive_link

@pjmattingly
Copy link
Author

pjmattingly commented Aug 2, 2023

(iterativelly?) add multiple debugs to understand what particular block of code is sos stuck,

I assume you mean https://en.wikipedia.org/wiki/Tracing_(software)?

  1. Adding print statements to the beginning and end of python files:
from pathlib import Path

for child in Path('./sos-dc190a9fde94f575e660e782831d922b89c00507').glob('**/*.py'):    
    with open(child, "r+") as f1:
        content = f1.read()
        f1.seek(0, 0)
        _add = f"print( 'Executing: {'/' + str(child.relative_to('./sos-dc190a9fde94f575e660e782831d922b89c00507'))}' )"
        f1.write(_add.rstrip('\r\n') + '\n' + '\n' + content)
        #print(_add.rstrip('\r\n') + '\n' + '\n' + content)

    with open(child, 'a') as f2:
        _add = "\n"
        _add += f"print( 'Leaving: {'/' + str(child.relative_to('./sos-dc190a9fde94f575e660e782831d922b89c00507'))}' )"
        f2.write(_add)
  1. ./bin/sos report -a --all-logs --sysroot=/mnt > ../out.txt

Then after several runs, a pattern emerged:

root@ubuntu-server:/home/peter# cat out.txt
Executing: /sos/__init__.py
Executing: /sos/options.py
Leaving: /sos/options.py
Leaving: /sos/__init__.py
Executing: /sos/report/__init__.py
Executing: /sos/report/plugins/__init__.py
Executing: /sos/utilities.py
Leaving: /sos/utilities.py
Executing: /sos/archive.py
Leaving: /sos/archive.py
Leaving: /sos/report/plugins/__init__.py
Executing: /sos/component.py
Leaving: /sos/component.py
Executing: /sos/policies/__init__.py
Executing: /sos/presets/__init__.py
Leaving: /sos/presets/__init__.py
Executing: /sos/policies/package_managers/__init__.py
Leaving: /sos/policies/package_managers/__init__.py
Leaving: /sos/policies/__init__.py
Executing: /sos/report/reporting.py
Leaving: /sos/report/reporting.py
Executing: /sos/cleaner/__init__.py
Executing: /sos/cleaner/preppers/__init__.py
Leaving: /sos/cleaner/preppers/__init__.py
Executing: /sos/cleaner/parsers/__init__.py
Leaving: /sos/cleaner/parsers/__init__.py
Executing: /sos/cleaner/parsers/ip_parser.py
Executing: /sos/cleaner/mappings/__init__.py
Leaving: /sos/cleaner/mappings/__init__.py
Executing: /sos/cleaner/mappings/ip_map.py
Leaving: /sos/cleaner/mappings/ip_map.py
Leaving: /sos/cleaner/parsers/ip_parser.py
Executing: /sos/cleaner/parsers/mac_parser.py
Executing: /sos/cleaner/mappings/mac_map.py
Leaving: /sos/cleaner/mappings/mac_map.py
Leaving: /sos/cleaner/parsers/mac_parser.py
Executing: /sos/cleaner/parsers/hostname_parser.py
Executing: /sos/cleaner/mappings/hostname_map.py
Leaving: /sos/cleaner/mappings/hostname_map.py
Leaving: /sos/cleaner/parsers/hostname_parser.py
Executing: /sos/cleaner/parsers/keyword_parser.py
Executing: /sos/cleaner/mappings/keyword_map.py
Leaving: /sos/cleaner/mappings/keyword_map.py
Leaving: /sos/cleaner/parsers/keyword_parser.py
Executing: /sos/cleaner/parsers/username_parser.py
Executing: /sos/cleaner/mappings/username_map.py
Leaving: /sos/cleaner/mappings/username_map.py
Leaving: /sos/cleaner/parsers/username_parser.py
Executing: /sos/cleaner/parsers/ipv6_parser.py
Executing: /sos/cleaner/mappings/ipv6_map.py
Leaving: /sos/cleaner/mappings/ipv6_map.py
Leaving: /sos/cleaner/parsers/ipv6_parser.py
Executing: /sos/cleaner/archives/__init__.py
Leaving: /sos/cleaner/archives/__init__.py
Executing: /sos/cleaner/archives/sos.py
Leaving: /sos/cleaner/archives/sos.py
Executing: /sos/cleaner/archives/generic.py
Leaving: /sos/cleaner/archives/generic.py
Executing: /sos/cleaner/archives/insights.py
Leaving: /sos/cleaner/archives/insights.py
Leaving: /sos/cleaner/__init__.py
Leaving: /sos/report/__init__.py
Executing: /sos/help/__init__.py
Leaving: /sos/help/__init__.py
Executing: /sos/collector/__init__.py
Executing: /sos/collector/sosnode.py
Executing: /sos/policies/init_systems/__init__.py
Leaving: /sos/policies/init_systems/__init__.py
Executing: /sos/collector/transports/__init__.py
Executing: /sos/collector/exceptions.py
Leaving: /sos/collector/exceptions.py
Leaving: /sos/collector/transports/__init__.py
Executing: /sos/collector/transports/juju.py
Leaving: /sos/collector/transports/juju.py
Executing: /sos/collector/transports/control_persist.py
Leaving: /sos/collector/transports/control_persist.py
Executing: /sos/collector/transports/local.py
Leaving: /sos/collector/transports/local.py
Executing: /sos/collector/transports/oc.py
Leaving: /sos/collector/transports/oc.py
Executing: /sos/collector/transports/saltstack.py
Leaving: /sos/collector/transports/saltstack.py
Leaving: /sos/collector/sosnode.py
Leaving: /sos/collector/__init__.py
WARNING: tmp-dir is set to a tmpfs filesystem. This may increase memory pressure and cause instability on low memory systems, or when using --all-logs.
Executing: /sos/policies/distros/__init__.py
Executing: /sos/policies/init_systems/systemd.py
Leaving: /sos/policies/init_systems/systemd.py
Executing: /sos/policies/runtimes/__init__.py
Leaving: /sos/policies/runtimes/__init__.py
Executing: /sos/policies/runtimes/crio.py
Leaving: /sos/policies/runtimes/crio.py
Executing: /sos/policies/runtimes/podman.py
Leaving: /sos/policies/runtimes/podman.py
Executing: /sos/policies/runtimes/docker.py
Leaving: /sos/policies/runtimes/docker.py
Leaving: /sos/policies/distros/__init__.py
Executing: /sos/policies/distros/amazon.py
Executing: /sos/policies/distros/redhat.py
Executing: /sos/presets/redhat/__init__.py
Leaving: /sos/presets/redhat/__init__.py
Executing: /sos/policies/package_managers/rpm.py
Leaving: /sos/policies/package_managers/rpm.py
Leaving: /sos/policies/distros/redhat.py
Leaving: /sos/policies/distros/amazon.py
Executing: /sos/policies/distros/anolis.py
Leaving: /sos/policies/distros/anolis.py
Executing: /sos/policies/distros/azure.py
Leaving: /sos/policies/distros/azure.py
Executing: /sos/policies/distros/circle.py
Leaving: /sos/policies/distros/circle.py
Executing: /sos/policies/distros/cos.py
Leaving: /sos/policies/distros/cos.py
Executing: /sos/policies/distros/debian.py
Executing: /sos/policies/package_managers/dpkg.py
Leaving: /sos/policies/package_managers/dpkg.py
Leaving: /sos/policies/distros/debian.py
Executing: /sos/policies/distros/opencloudos.py
Leaving: /sos/policies/distros/opencloudos.py
Executing: /sos/policies/distros/openeuler.py
Leaving: /sos/policies/distros/openeuler.py
Executing: /sos/policies/distros/rocky.py
Leaving: /sos/policies/distros/rocky.py
Executing: /sos/policies/distros/suse.py
Leaving: /sos/policies/distros/suse.py
Executing: /sos/policies/distros/ubuntu.py
Executing: /sos/policies/package_managers/snap.py
Leaving: /sos/policies/package_managers/snap.py
Leaving: /sos/policies/distros/ubuntu.py
Executing: /sos/policies/distros/uniontechserver.py
Leaving: /sos/policies/distros/uniontechserver.py

sosreport (version 4.5.5)

Executing: /sos/report/plugins/abrt.py
Leaving: /sos/report/plugins/abrt.py
Executing: /sos/report/plugins/acpid.py
Leaving: /sos/report/plugins/acpid.py
Executing: /sos/report/plugins/activemq.py
Leaving: /sos/report/plugins/activemq.py
Executing: /sos/report/plugins/alternatives.py
Leaving: /sos/report/plugins/alternatives.py
Executing: /sos/report/plugins/anaconda.py
Leaving: /sos/report/plugins/anaconda.py
Executing: /sos/report/plugins/anacron.py
Leaving: /sos/report/plugins/anacron.py

Exiting on user cancel
  1. Checking for the caller of /sos/report/plugins/anacron.py:
grep -ir "anacron" .

grep: ./sos/report/plugins/__pycache__/cron.cpython-310.pyc: binary file matches
grep: ./sos/report/plugins/__pycache__/anacron.cpython-310.pyc: binary file matches
./sos/report/plugins/cron.py:    packages = ('cron', 'anacron', 'chronie')
./sos/report/plugins/anacron.py:print( 'Executing: /sos/report/plugins/anacron.py' )
./sos/report/plugins/anacron.py:class Anacron(Plugin, IndependentPlugin):
./sos/report/plugins/anacron.py:    short_desc = 'Anacron job scheduling service'
./sos/report/plugins/anacron.py:    plugin_name = 'anacron'
./sos/report/plugins/anacron.py:    packages = ('anacron', 'chronie-anacron')
./sos/report/plugins/anacron.py:    # anacron may be provided by anacron, cronie-anacron etc.
./sos/report/plugins/anacron.py:    files = ('/etc/anacrontab',)
./sos/report/plugins/anacron.py:print( 'Leaving: /sos/report/plugins/anacron.py' )
./sos.spec:- Collect /etc/anacrontab in system plugin

I'm guessing that the plugins are run in different threads, and that either: (1) There's some issue with anacron.py, or (2) there's some issue with the threading logic. I'm not sure of (1) since execution leaves anacron.py, as per the logs. But then it's not clear how anacron.py is called, as a grep of the source doesn't show it imported or called directly. Then (2), I assume sos is using some sort of worker pool approach, which implies that it might be that the workers aren't being properly released after their work is done and the worker pool and sos blocks waiting for a worker to be free for the next plugin to run?

Looking forward to your reply.

@TurboTurtle TurboTurtle modified the milestones: 4.6.0, 4.6.1 Aug 21, 2023
@TurboTurtle TurboTurtle removed this from the 4.6.1 milestone Jan 9, 2024
@pjmattingly
Copy link
Author

@TurboTurtle

Hi Jake,

I wanted to check on the status of this? I had been keeping tabs on this, as using sosreport form a live system would help me to complete a KB(s) I'm writing.

Thanks.
--Peter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants