Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NNG Panic on Python shutdown #108

Open
broglep-work opened this issue Jul 14, 2022 · 5 comments · May be fixed by #125
Open

NNG Panic on Python shutdown #108

broglep-work opened this issue Jul 14, 2022 · 5 comments · May be fixed by #125

Comments

@broglep-work
Copy link

broglep-work commented Jul 14, 2022

We experience (under certain conditions) reproducible crash of python process.

panic: pthread_mutex_lock: Invalid argument
This message is indicative of a BUG.
Report this at https://github.com/nanomsg/nng/issues
/home/vsts/work/1/lib/python3.8/site-packages/pynng/_nng.abi3.so(nni_panic+0x112) [0x7fc452152762]
/home/vsts/work/1/lib/python3.8/site-packages/pynng/_nng.abi3.so(nni_aio_fini+0x20) [0x7fc45214bfe0]
/home/vsts/work/1/lib/python3.8/site-packages/pynng/_nng.abi3.so(nni_aio_free+0x1b) [0x7fc45214c11b]
/home/vsts/work/1/lib/python3.8/site-packages/pynng/_nng.abi3.so(+0x4632f) [0x7fc4520fd32f]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(+0x158783) [0x7fc4569ea783]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x4adc) [0x7fc456a31fcc]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0xfa) [0x7fc4569c2dfa]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x7b8) [0x7fc456a2dca8]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x301) [0x7fc456a2c8d1]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x18e) [0x7fc4569c2e8e]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(+0x1337b3) [0x7fc4569c57b3]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x27c4) [0x7fc456a2fcb4]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(+0x13770b) [0x7fc4569c970b]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(+0x1e8624) [0x7fc456a7a624]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(+0x1e8d67) [0x7fc456a7ad67]
/opt/hostedtoolcache/Python/3.8.13/x64/lib/libpython3.8.so.1.0(+0x1e867e) [0x7fc456a7a67e]

My investigation showed that it is caused by access of nni_aio_lk which was deinit'd by nng_fini / nni_fini / nni_aio_sys_fini. nni_aio_sys_fini is triggered by _pynng_atexit and nni_aio_free by AIOHelper.__del__

This looks like a timing issue on python shutdown: depending on atexit functions are called and objects garbage collected, it might be possible that nng was already deinitialized and pynng, but pynng is still calling the nng lib

It is currently unclear how best to address this issue. Any ideas?

(for background, this occurs reproducibly in our CI when using pynng 0.7.1, asyncio & pytest, but we did also see it from time to time when running our application on developer machines and stopping the application)

@ntakouris
Copy link

I can confirm this bug, with the same setup.

@codypiersall
Copy link
Owner

This is a good find. Thanks for posting the issue, and I think you nailed the problem.

I'm not sure the best solution either. A crash is never good. I can't completely remember why we even call nni_fini(), maybe just for good hygiene? I can only think of a couple use cases for calling it:

  1. If trying to detect a memory leak with Valgrind, it will report that a lot of stuff leaked at the process exit
  2. If a process is embedding a Python interpreter, this would allow it to shut it down and restart it (possibly)

Maybe we could add a global variable to pynng, a bool called nng_fini_at_exit, that gets checked when deciding whether to call nng_fini() or not. It should probably default to False to avoid any process-crashing race conditions.

Thoughts?

@broglep-work
Copy link
Author

I think that solution is ok, and having it default to False is also good as for most people it probably would not make much of a difference as the process is anyway about to end. For added flexibility we could expose the nng_fini in the pynng api, so that the user can call it themselves at the appropriate time. In that case maybe we could even get rid of nng_fini_at_exit and _pynng_atexit and let the user himself register atexit.register() if it is necessary for the usecase (and have an corresponding section in the documentation about that)

@jason-cso
Copy link

I ran into the same bug and wrote a simple repro case. About half the time I get nni logs like OP.

#!/usr/bin/env python3

from pynng import Surveyor0, Respondent0, Timeout
import os
import signal
import threading
import traceback
import time

address = "tcp://127.0.0.1:13812"

def daemon(function):
    t = threading.Thread(target=function, daemon=True)
    print(f"Starting thread: {function.__name__}() as {t.name}")
    t.start()
    return t

def s():
    with Surveyor0(listen=address) as surveyor:
        surveyor.survey_time = 500  # milliseconds
        while True:
            surveyor.send(b"foo")
            try:
                while True:
                    m = surveyor.recv()
                    print(f"Got {m} from client")
            except Timeout:
                pass

def c():
    with Respondent0(dial=address) as responder:
        while True:
            m = responder.recv()
            print(f"Got {m} from server")
            responder.send(b"bar")

daemon(s)
time.sleep(0.1)
daemon(c)
time.sleep(2)

@olivierpelet
Copy link

I ran into the same bug and wrote a simple repro case. About half the time I get nni logs like OP.

#!/usr/bin/env python3

from pynng import Surveyor0, Respondent0, Timeout
import os
import signal
import threading
import traceback
import time

address = "tcp://127.0.0.1:13812"

def daemon(function):
    t = threading.Thread(target=function, daemon=True)
    print(f"Starting thread: {function.__name__}() as {t.name}")
    t.start()
    return t

def s():
    with Surveyor0(listen=address) as surveyor:
        surveyor.survey_time = 500  # milliseconds
        while True:
            surveyor.send(b"foo")
            try:
                while True:
                    m = surveyor.recv()
                    print(f"Got {m} from client")
            except Timeout:
                pass

def c():
    with Respondent0(dial=address) as responder:
        while True:
            m = responder.recv()
            print(f"Got {m} from server")
            responder.send(b"bar")

daemon(s)
time.sleep(0.1)
daemon(c)
time.sleep(2)

I've forked and updated nng (and mbedtls) to the latest version, and the code does not crash anymore.
Also, I've seen that @broglep-work and another user made some patches in a fork here.
I will probably include these patchs in my branch as well

@broglep-work broglep-work linked a pull request Jan 29, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants