Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read stop signals from the process and update the process state. #5361

Merged
merged 89 commits into from May 22, 2024

Conversation

anki-code
Copy link
Member

@anki-code anki-code commented Apr 24, 2024

Reading stop signals from the process and update the process state.

The issue

Technically. In a couple of places that critical for processing signals we have os.waitpid(). The function behavior is pretty unobvious and one of things is processing return code after catching the signal. We had no good signal processing around this and this PR fixes this. See also proc_untraced_waitpid function description.

From user perspective. For example we have process that is waiting for user input from terminal e.g. python -c "input()" or fzf. If this process will be in captured pipeline e.g. !(echo 1 | fzf | head) it will be suspended by OS and the pipeline will be in the endless loop with future crashing and corrupting std at the end. This PR fixes this.

The solution

Technically. The key function is proc_untraced_waitpid - it catches the stop signals and updates the process state.

From user perspective. First of all we expect that users will use captured object !() only for capturable processes. Because of it our goal here is to just make the behavior in this case stable.
In this PR we detect that process in the pipeline is suspended and we need to finish the command pipeline carefully:

  • Show the message about suspended process.
  • Keep suspended process in jobs. The same behavior we can see in bash. This is good because we don't know what process suspended and why. May be experienced user will want to continue it manually.
  • Finish the CommandPipeline with returncode=None and suspended=True.

Before

!(fzf) # or !(python -c "input()")
# Hanging / Exceptions / OSError / No way to end the command.
# After exception:
$(echo 1)
# OSError / IO error

After

!(fzf) # or `!(ls | fzf | head)` or `!(python -c "input()")`
# Process ['fzf'] with pid 60000 suspended with signal 22 SIGTTOU and stay in `jobs`.
# This happends when process start waiting for input but there is no terminal attached in captured mode.
# CommandPipeline(returncode=None, suspended=True, ...)

$(echo 1)
# Success.

Closes #4752 #4577

Notes

  • There is pretty edge case situation when the process was terminated so fast that we can't catch pid alive and check signal (src). I leave it as is for now.

Mentions

#2159

For community

⬇️ Please click the 👍 reaction instead of leaving a +1 or 👍 comment

@anki-code
Copy link
Member Author

anki-code commented Apr 26, 2024

Helpful things for knight

image

Docs

Interactive debugging in IDE

Setup PyCharm and have an experience with breakpoints.

Tools for modeling the process behavior

sleep 10  # Tool that are easy to run in background and interrupt.
python -c 'input()'  # Easy way to run unkillable (in some cases) input-blocked app.
fzf  # Crazy tool. It reads input from STDIN, prints UI to STDERR, waits input from /dev/tty, prints result to STDOUT.

See also "Python app to test catching all signals" below.

Test in pure Linux environment

docker run --rm -it xonsh/xonsh:slim bash -c "pip install -U 'xonsh[full]' && xonsh"
# a1b2c3  # docker container id
apt update && apt install -y vim git procps strace  # to run `ps`
# Connect to container from other terminal:
docker exec -it a1b2c3 bash
# Save docker container state to reuse:
docker ps
docker commit c3f279d17e0a local/my_xonsh  # the same for update
docker run --rm -it local/my_xonsh xonsh

Trace signals with strace:

python -c 'input()' & 
# pid 123
strace -p 123
# strace: Process 123 attached
# [ Process PID=123 runs in x32 mode. ]
# --- stopped by SIGTTIN ---
kill -SIGCONT 72  # From another terminal.
# --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=48, si_uid=0} ---
# syscall_0x7ffffff90558(0x5555555c5a00, 0, 0x7fffff634940, 0, 0, 0x3f) = 0x2
# --- SIGTTIN {si_signo=SIGTTIN, si_code=SI_KERNEL} ---
# --- stopped by SIGTTIN ---

Monitor process state codes (STAT)

ps ax  # Tool to monitor processes and have the process state in STAT column.
# STAT:
D    uninterruptible sleep (usually IO)
R    running or runnable (on run queue)
S    interruptible sleep (waiting for an event to complete)
T    stopped, either by a job control signal or because it is being traced.
W    paging (not valid since the 2.6.xx kernel)
X    dead (should never be seen)
Z    defunct ("zombie") process, terminated but not reaped by its parent.

Hot to send signals

ps ax | grep fzf   # get pid=123
kill -SIGINT 123
kill -SIGCONT 123

watch -n1 'ps ax | grep fzf'  # Monitor process state after sending signals (STAT).

Decode process signals from os.waitpid

UPD: This moved to xonsh.tools.describe_waitpid_status and enriched.

pid, proc_status = os.waitpid(pid_of_child_process, os.WUNTRACED)

# WILL BE IN: from xonsh.tools import describe_waitpid_status
def describe_waitpid_status(proc_status):
    """https://docs.python.org/3/library/os.html#os.waitpid"""
    import os
    funcs = [os.WIFEXITED, os.WEXITSTATUS, os.WIFSIGNALED, os.WTERMSIG, os.WIFSTOPPED, os.WSTOPSIG]
    for f in funcs:
        print(f.__name__, '-', f(proc_status), '-', f.__doc__)

describe_waitpid_status(proc_status=5759)
# WIFEXITED - False - Return True if the process returning status exited via the exit() system call.
# WEXITSTATUS - 22 - Return the process return code from status.
# WIFSIGNALED - False - Return True if the process returning status was terminated by a signal.
# WTERMSIG - 127 - Return the signal that terminated the process that provided the status value.
# WIFSTOPPED - True - Return True if the process returning status was stopped.
# WSTOPSIG - 22 - Return the signal that stopped the process that provided the status value.

Python app to test catching all signals

#!/usr/bin/env python

import os, sys, signal, time
from datetime import datetime

ASK_INPUT = True
OUTPUT_FILE = '/tmp/sig'  # Use `tail -f /tmp/sig` to monitor the process.

def log(msg, file=sys.stdout):
    current_datetime = datetime.now()
    iso_date = current_datetime.isoformat()
    print(f"{os.getpid()} {iso_date} {msg}", file=file)

def get_signal_name(signum):
    for name in dir(signal):
        if name.startswith("SIG") and getattr(signal, name) == signum:
            return name
    return ""

def main():
    if file:
        with open(file, 'a') as f:
            log(f"start catching", file=f)

    def signal_handler(signum, frame):
        if file:
            with open(OUTPUT_FILE, 'a') as f:
                log(f"signal {signum} {get_signal_name(signum)}", file=f)
        log(f"signal {signum} {get_signal_name(signum)}", file=sys.stdout)

    # Set handlers for all signals.
    for sig in range(1, signal.NSIG):
        # Skip system signals.
        if sig in [signal.SIGKILL, signal.SIGSTOP, signal.SIGINT]:
            continue

        # You may remove this if needed.
        if sig in [signal.SIGINT]:
            continue

        try:
            signal.signal(sig, signal_handler)
        except (ValueError, OSError) as e:
            log(f"exception on setting signal {sig} {get_signal_name(sig)}: {e}", file=f)

    log(f"start catching")

    i = 0
    while True:
        if ASK_INPUT:
            # Low level method
            os.read(0, 1024)
            # High level method
            # print(repr(input('Input:')))
            sys.exit()
        else:
            log(f'sleep {i}')
            if echo_sleep:
                time.sleep(1)

        i += 1

if __name__ == "__main__":
    main()

Note! We read in the manual that "Python signal handlers are always executed in the main Python thread of the main interpreter, even if the signal was received in another thread." (source). But this is not related to situation when we run subprocess. First of all the signal (e.g. SIGINT) will go to foreground subprocess task. So you need to test this carefully for case main_thread+thread+popen_subprocess.

Stress test

while !(python -c 'import sys; sys.exit(1)') != 0:
    print('.', end='')

Test using bash process manager

You can test any case around processes and signals using bash pipes and streams *:

bash
sleep 100 &
# [1] 332211
jobs
# [1]  + running    sleep 100
ps ax | grep sleep  # pid=332211
# 332211 s005  SN     0:00.00 sleep 100
kill -SIGINT 332211

fzf 2>/dev/null &  # run fzf with hidden TUI (fzf is using stderr to show it)
fzf < /dev/null &  # disable stdin
ps fzf | grep fzf
# etc etc etc

You should know about sigmask

Process by itself can catch signals. The sigmask is describing what signals process intended to catch. I used sigmask tool to decrypt fzf sigmask:

apt install -y golang-go
go get -v github.com/r4um/sigmask
ps ax | grep fzf  # pid=123

~/go/bin/sigmask 123  # SigCgt - signals that process wants to catch by itself.
# SigPnd
# ShdPnd
# SigBlk SIGUSR1,SIGUSR2,SIGPIPE,SIGALRM,SIGTSTP,SIGTTIN,SIGTTOU,SIGURG,SIGXCPU,SIGXFSZ,SIGVTALRM,SIGIO,SIGPWR,SIGRTMIN
# SigIgn
# SigCgt SIGHUP,SIGINT,SIGQUIT,SIGILL,SIGTRAP,SIGABRT,SIGBUS,SIGFPE,SIGUSR1,SIGSEGV,SIGUSR2,SIGPIPE,SIGALRM,SIGTERM,SIGSTKFLT

Trace xonsh code using xunter

mkdir -p ~/git/ && cd ~/git/
git clone git+https://github.com/xonsh/xonsh
cd xonsh
xunter --no-rc ++cwd ++filter 'Q(filename_has="procs/")' ++output /tmp/procs.xun
# In another terminal:
tail -f /tmp/procs.xun

Interesting case for tracing

for i in range(5):
    print(i)
    sleep 10

After Ctrl-c we have continue with the next iteration. It's annoying. Here is the related code:

xonsh/xonsh/jobs.py

Lines 269 to 288 in 26c6e11

try:
_, wcode = os.waitpid(obj.pid, os.WUNTRACED)
except ChildProcessError as e: # No child processes
if return_error:
return e
else:
return _safe_wait_for_active_job(
last_task=active_task, backgrounded=backgrounded
)
if os.WIFSTOPPED(wcode):
active_task["status"] = "stopped"
backgrounded = True
elif os.WIFSIGNALED(wcode):
print() # get a newline because ^C will have been printed
obj.signal = (os.WTERMSIG(wcode), os.WCOREDUMP(wcode))
obj.returncode = None
else:
obj.returncode = os.WEXITSTATUS(wcode)
obj.signal = None
return wait_for_active_job(last_task=active_task, backgrounded=backgrounded)

When we run sleep 10 the system process go to wait and we stuck on os.waitpid. After pressing Ctrl+C (sending SIGINT) process get the SIGINT signal and we catch that signal appeared using WIFSIGNALED (line 281) but have no actions about stopping the execution entirely and we continue execution.
It would be cool to considering the cases around this (i.e. can we really stop the execution of the whole code?) and stopping the execution carefully here.

xonsh/procs/posix.py Outdated Show resolved Hide resolved
@anki-code anki-code marked this pull request as draft May 2, 2024 10:25
@anki-code
Copy link
Member Author

I converted this to draft because I want to try to create test.

anki-code added a commit that referenced this pull request May 2, 2024
### Motivation

There is annoying behavior when you run command in the loop and can't
interrupt e.g. [this
report](#5371) and a bit
#5342. After diving into this I see the issue around return code.

### The essence

Basically ``p = subprocess.Popen()`` populates ``p.returncode`` after
``p.wait()``, ``p.poll()`` or ``p.communicate()``
([doc](https://docs.python.org/3/library/os.html#os.waitpid)). But if
you're using `os.waitpid()` BEFORE these functions you're capturing
return code from a signal subsystem and ``p.returncode`` will be ``0``
like success but it's not success. So after ``os.waitid`` call you need
to set return code manually ``p.returncode = -os.WTERMSIG(status)`` like
in Popen. Example:
```xsh
python  # python interactive

import os, signal, subprocess as sp

p = sp.Popen(['sleep', '100'])
p.pid
# 123
p.wait()
# Ctrl+C or `kill -SIGINT 123` from another terminal
p.returncode
# -2

# BUT:

p = sp.Popen(['sleep', '100'])
p.pid
# 123

pid, status = os.waitpid(p.pid, os.WUNTRACED)
# status=2

# From another terminal:
kill -SIGINT 123

p.wait()
# 0
p.returncode
# 0
```

```xsh
from xonsh.tools import describe_waitpid_status
describe_waitpid_status(2)
# WIFEXITED - False - Return True if the process returning status exited via the exit() system call.
# WEXITSTATUS - 0 - Return the process return code from status.
# WIFSIGNALED - True - Return True if the process returning status was terminated by a signal.
# WTERMSIG - 2 - Return the signal that terminated the process that provided the status value.
# WIFSTOPPED - False - Return True if the process returning status was stopped.
# WSTOPSIG - 0 - Return the signal that stopped the process that provided the status value.
```

See also: [Helpful things for
knight](#5361 (comment)).

### Before

```xsh
$RAISE_SUBPROC_ERROR = True

sleep 100
# Ctrl+C
_.rtn
# 0  # It's wrong and RAISE_SUBPROC_ERROR ignored.

for i in range(5):
    print(i)
    sleep 5
# 0
# Ctrl+C  # Can't interrupt
# 1
# 2 
```

### After
```xsh
sleep 100
# Ctrl+C
_.rtn
# -2  # Like in default Popen

$RAISE_SUBPROC_ERROR = True
for i in range(5):
    print(i)
    sleep 5
# 0
# Ctrl+C
# subprocess.CalledProcessError
```

### Notes

* We need to refactor `xonsh.jobs`. It's pretty uncomfortable work with
module.
* The logic is blurry between Specs, Pipelines and Jobs. We need to
bring more clear functions.
* The `captured` variable looks like "just the way of capturing (stdout,
object)" but in fact it affects all logic very much. We need to create
table where we can see the logic difference for every type of capturing.
E.g. in `captured=stdout` mode we use `xonsh.jobs` to monitor the
command but in `captured=object` we avoid this and some logic from
`xonsh.jobs` applied in `stdout` mode but missed in `object` mode. We
need clear map.

## For community
⬇️ **Please click the 👍 reaction instead of leaving a `+1` or 👍
comment**

---------

Co-authored-by: a <1@1.1>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@anki-code anki-code changed the title Added processing the case where interactive command was suspended by OS in full captured mode Added processing the case where interactive command was suspended by OS May 13, 2024
@anki-code anki-code marked this pull request as ready for review May 15, 2024 23:25
@anki-code
Copy link
Member Author

anki-code commented May 16, 2024

@jnoortheen @gforsyth After epic struggle during 3 weeks I'm happy to say that it's ready to review.
The first message was updated and the tests were started many times to have confidence that catching signals is stable.

@jnoortheen
Copy link
Member

I came across this lib https://github.com/oconnor663/duct.py/blob/master/gotchas.md . Can this be helpful?

@anki-code
Copy link
Member Author

anki-code commented May 16, 2024

@jnoortheen it's very helpful library! Cudos to you and to the oconnor663! I will take a look deeper.
(This is NOT the stopper to merge this PR)

@jnoortheen
Copy link
Member

Yes this is great work. We can take pointers from this repo.

@anki-code
Copy link
Member Author

@gforsyth hey! It's ready to review, please take a look! Thanks!

Copy link
Collaborator

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good @anki-code -- I made a few suggestions for the error message and we should update the formatting for jobs --posix, but otherwise this can go in.

xonsh/jobs.py Outdated Show resolved Hide resolved
xonsh/procs/pipelines.py Outdated Show resolved Hide resolved
xonsh/jobs.py Show resolved Hide resolved
anki-code and others added 4 commits May 22, 2024 16:49
Co-authored-by: Gil Forsyth <gforsyth@users.noreply.github.com>
Co-authored-by: Gil Forsyth <gforsyth@users.noreply.github.com>
@anki-code anki-code requested a review from gforsyth May 22, 2024 15:04
@anki-code
Copy link
Member Author

anki-code commented May 22, 2024

@gforsyth comments resolved. Thank you for review! Waiting for final approve.
@jnoortheen thank you for review!

Copy link
Collaborator

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Nice work @anki-code -- I think the helpful notes for the knight is an especially good write-up. Really awesome stuff here.

@gforsyth gforsyth changed the title Reading stop signals from the process and update the process state. Read stop signals from the process and update the process state. May 22, 2024
@gforsyth gforsyth merged commit 0f25a5a into main May 22, 2024
12 checks passed
@gforsyth gforsyth deleted the fix_int_suspended branch May 22, 2024 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alias doesn't set returncode
3 participants