Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create base Daemon class and refactor Undertaker, Reaper, Abacus Account and Judge Repairer daemons to inherit from base class #6489

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

rdimaio
Copy link
Contributor

@rdimaio rdimaio commented Jan 26, 2024

Initial commit where base class Daemon is created, and Undertaker inherits from Daemon. If approved, I will refactor the other daemons in the same way. Tested in local fork.

(note: the changes in common.py outside of the main Daemon class are due to the black formatter adjusting style to meet PEP8)

Initial commit for #6478

rcarpa
rcarpa previously approved these changes Jan 29, 2024
@rdimaio rdimaio changed the title Create base Daemon class and refactor Undertaker to inherit from base class Create base Daemon class and refactor Undertaker and Reaper to inherit from base class Jan 30, 2024
@rdimaio
Copy link
Contributor Author

rdimaio commented Jan 30, 2024

In 29a5666, Reaper is also refactored to follow the base Daemon class. Tests pass on fork branch: rdimaio@29a5666

@rdimaio rdimaio force-pushed the patch-6478-daemon-refactor-undertaker branch 2 times, most recently from c2ad227 to 8eb51c8 Compare February 2, 2024 10:17
@rdimaio rdimaio changed the title Create base Daemon class and refactor Undertaker and Reaper to inherit from base class Create base Daemon class and refactor Undertaker, Reaper, Abacus Account and Judge Repairer daemons to inherit from base class Feb 2, 2024
Copy link
Contributor

@dchristidis dchristidis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial impression was positive; I think the proposed design has a lot of merit. But I would not recommend its inclusion in its current state.

What hinders the review is the mix of functional and aesthetical changes in the same commit. Things like folding long lines, changing indentation levels, and adding type hints to existing functions, are all very welcome, but bundling them together with the actual behavioural changes for PRs as large as this should be discouraged.

Then there’s the change of existing strings to use double instead of single quotes. The Rucio project does not have an official guideline, so the recommendation is to follow the existing pattern when modifying code and using one’s preference when introducing big blocks of new code. But I’m not too enthusiastic about trying to conform an entire existing file. What if the next developer that works on it thinks that single quotes are better? It can also be an annoyance when going through the history of a file.

I stopped halfway through lib/rucio/daemons/common.py, without looking too close at the rest. I would be very happy to continue the review, but if ends up being up to me, I think the PR may need to go through multiple iterations before I can approve it.

PS: Afterwords I saw that:

(note: the changes in common.py outside of the main Daemon class are due to the black formatter adjusting style to meet PEP8)

Black is highly opinionated. Its use induced negative reactions in the past. I would be careful.

lib/rucio/daemons/common.py Outdated Show resolved Hide resolved
lib/rucio/daemons/common.py Outdated Show resolved Hide resolved
lib/rucio/daemons/common.py Outdated Show resolved Hide resolved
lib/rucio/daemons/common.py Outdated Show resolved Hide resolved
lib/rucio/daemons/common.py Outdated Show resolved Hide resolved
lib/rucio/daemons/common.py Outdated Show resolved Hide resolved
Comment on lines 126 to 133
# Interruptible joins require a timeout.
while thread_list:
thread_list = [
thread.join(timeout=3.14)
for thread in thread_list
if thread and thread.is_alive()
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, I think placing the comment above the block makes it unclear. Better put it together with line 129.

But more importantly, do you understand this loop? threading.Thread.join() always returns None, according to its documentation. So under what circumstances could we have multiple iterations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole code block was copied from Reaper:

    # Interruptible joins require a timeout.
    while threads_list:
        threads_list = [thread.join(timeout=3.14) for thread in threads_list if thread and thread.is_alive()]

In general, every daemon involved in the refactoring had a loop of this kind, e.g. in Undertaker:

       # Interruptible joins require a timeout.
        while threads[0].is_alive():
            [t.join(timeout=3.14) for t in threads]

Based on the fact that threading.Thread.join() returns None, then I think the version of the loop present in Undertaker is the one that's actually correct, given that it loops as long as the first thread is alive, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m afraid I don’t know. That’s why I’m genuinely asking, because you’re (rightfully) copying existing implementations, and maybe they copied it over from somewhere else, but I wonder if anyone really understands them.

So, since the long-term goal of your effort is to implement this exactly once, I am taking the opportunity to try to understand exactly what it is supposed to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I look at the existing implementations, the more perplexed I am. Won’t the loop in the Reaper exit after one iteration, while the threads are still running?

>>> import threading
>>> import time
>>> def worker():
...     while True:
...         time.sleep(60)
... 
>>> threads_list = [threading.Thread(target=worker) for _ in range(2)]
>>> for thread in threads_list:
...     thread.start()
... 
>>> while threads_list:
...     print(threads_list)
...     threads_list = [thread.join(timeout=0.1) for thread in threads_list if thread and thread.is_alive()]
... 
[<Thread(Thread-1 (worker), started 140304535520960)>, <Thread(Thread-2 (worker), started 140304525035200)>]
[None, None]
>>> # Huh?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following my own little tests, I am starting to question the need for a timeout and whether the try-except structure we have have in the daemons for keyboard interrupts is working how it is supposed to.

I am also looking for reproducible example as to why the following won’t do:

for thread in threads:
    thread.join()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m getting different results in Python 3.6 and 3.9. 🤷 I think this further drives the point that we implemented a solution in Python 2, based on the situation at the time, and kept it as it is without checking if it still makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this! I'm not sure what's the best way forward here - the different results in different Python versions are a bit worrying (though AFAIK Rucio requires Python 3.9 to run, so at least from the user's perspective it will be consistent), but it might also be one of those things that "if it works now, better not touch it" (even though it doesn't really work).

It's definitely a carryover from Python 2 in some way, I found this commit under this issue for Python 3 compatibility with daemons, and that logic was left untouched

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, the implementation seen in the Undertaker has been there since the very beginning (over a decade ago, see b896d4c). The justification is lost in time.

Here’s a modern test case:

#!/usr/bin/env python3

import os
import signal
import threading
import time
from types import FrameType
from typing import Optional

GRACEFUL_STOP = threading.Event()


def worker() -> None:
    while not GRACEFUL_STOP.is_set():
        print('worker active')
        time.sleep(1)
    print('worker terminating')
    time.sleep(5)
    print('worker terminated')


def stop(signum: Optional[int] = None, frame: Optional[FrameType] = None) -> None:
    GRACEFUL_STOP.set()
    if signum == signal.SIGINT:
        if hasattr(stop, 'received_sigint'):
            exit(1)
        else:
            stop.received_sigint = True

if __name__ == '__main__':
    if 'CAPTURE_SIGINT' in os.environ:
        signal.signal(signal.SIGINT, stop)
    threads = [threading.Thread(target=worker) for _ in range(1)]
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    print('main thread terminated')

Here’s a sample run with Python 3.9:

$ python3 test.py 
worker active
worker active
^CTraceback (most recent call last):
  File "/afs/cern.ch/user/d/dichrist/test.py", line 37, in <module>
    thread.join()
  File "/usr/lib64/python3.9/threading.py", line 1060, in join
    self._wait_for_tstate_lock()
  File "/usr/lib64/python3.9/threading.py", line 1080, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
KeyboardInterrupt

$ CAPTURE_SIGINT=TRUE python3 test.py 
worker active
worker active
^Cworker terminating
worker terminated
main thread terminated
$ CAPTURE_SIGINT=TRUE python3 test.py 
worker active
worker active
^C^C$ 

It all comes down to this:

  1. If we don’t care about graceful shutdown with SIGINT, then the simple implementation works.
  2. If we care about it, then we can easily adjust the stop() function which is already used for SIGTERM to exit gracefully upon the first keyboard interrupt and immediately upon the second.

@bari12 @mlassnig Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would adjust this. We should take this as an opportunity to renew the code here and don't carry outdated (not to say wrong) choices further in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the question remains: should we implement Ctrl+C leading to a graceful stop?

lib/rucio/daemons/common.py Outdated Show resolved Hide resolved
lib/rucio/daemons/common.py Outdated Show resolved Hide resolved
"""
pass

def _call_daemon(self, activities: Optional[list[str]] = None) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should take the opportunity to rethink some of the chosen names. There’s ‘pre-run’, ‘run’, ‘run once’, ‘call daemon’. Do they feel intuitive? Had I not known, I would have guessed that _call_daemon() is the outer function, _pre_run_checks() is executed before run(), and that run() and _run_once() are alternatives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with changing the names if there are any suggestions, but I could not think of anything else at the moment. There are some things that I do not like, e.g. I think once and consequently run_once are not clear enough, but those names currently match some of the names from the daemon's user's perspective (e.g. the --run-once flag) so I didn't want to change them

@rdimaio
Copy link
Contributor Author

rdimaio commented Feb 14, 2024

Things like folding long lines, changing indentation levels, and adding type hints to existing functions, are all very welcome, but bundling them together with the actual behavioural changes for PRs as large as this should be discouraged.

I agree on all points except for type hinting. On the formatting side, I recognize I made a mistake in using a formatter, as it makes it harder to spot the functional changes; I'm happy to do whatever you think it's best to do here, e.g. if you want me to amend the commit so that the formatting changes are removed I'm okay with doing that, but if you have other ideas let me know. I will be careful about this for future PRs.

For the type hinting to existing functions: the functions whose type hints have been changed/added in this PR are related to the daemons that have been refactored (e.g. parse_expression is used in Reaper). I added type hints to those functions while refactoring the daemons in order to better understand the logic in the daemons themselves. I understand that these type hints are not part of the "refactoring" itself, but I personally don't see an issue in having them within the same PR, but of course this is just my opinion so I'm okay with separating them if you prefer.

@dchristidis
Copy link
Contributor

It’s the scale of the PR that makes this matter to me. If it were a few-line change in a single function, that also happened to add type hints, then I would not object to a single commit. This, however, is over one thousand lines and spans multiple files. Hence why I tend to request that behavioural and non-behavioural changes are separated (and type hints are part of the latter). Both are welcome, of course! But as a reviewer, it aids me to know when a commit is not supposed to make any effective changes.

So yes, if Martin expects the final say from me, then I need to ask you to rewrite the commits as described.

@rdimaio rdimaio force-pushed the patch-6478-daemon-refactor-undertaker branch from 8eb51c8 to 1cbb3d2 Compare February 15, 2024 13:08
@rdimaio
Copy link
Contributor Author

rdimaio commented Feb 15, 2024

It’s the scale of the PR that makes this matter to me. If it were a few-line change in a single function, that also happened to add type hints, then I would not object to a single commit. This, however, is over one thousand lines and spans multiple files. Hence why I tend to request that behavioural and non-behavioural changes are separated (and type hints are part of the latter). Both are welcome, of course! But as a reviewer, it aids me to know when a commit is not supposed to make any effective changes.

So yes, if Martin expects the final say from me, then I need to ask you to rewrite the commits as described.

I've rebased the PR to remove all unnecessary non-behavioural changes

@rdimaio rdimaio force-pushed the patch-6478-daemon-refactor-undertaker branch 2 times, most recently from a4ff256 to d45c5ad Compare February 15, 2024 13:17
@rdimaio rdimaio force-pushed the patch-6478-daemon-refactor-undertaker branch from d45c5ad to 056cb36 Compare February 29, 2024 13:24
"""
self.graceful_stop.set()
logging.info("%s: terminating", self.daemon_name)
if signum == signal.SIGINT or signum == signal.SIGTERM:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be sigint only, not sigterm as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth noting: the daemons in their current version use SIGTERM, not SIGINT, e.g.

signal.signal(signal.SIGTERM, termhandler)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Auditor is not a good example.

SIGTERM is what the OS/supervisor/scheduler is using to terminated a process under its control. It is supposed to send SIGKILL after a timeout.

SIGINT is what Ctrl+C sends. It only makes sense when one starts a daemon process from their terminal, which is unlikely, especially in a production environment.

The daemons that are (supposedly) handling SIGNIT are doing so via the KeyboardInterrupt exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I linked to Auditor as a random example, but it was part of this search https://github.com/search?q=repo%3Arucio%2Frucio%20sigterm&type=code where it seems like most daemons use SIGTERM. I agree that for the purpose of gracefully stopping manually we should use SIGINT, I just wanted to bring this up for reference

Comment on lines 74 to +77
try:
run(total_workers=args.total_workers, chunk_size=args.chunk_size, once=args.run_once,
sleep_time=args.sleep_time)
undertaker.run()
except KeyboardInterrupt:
stop()
undertaker.stop()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this try except is not needed

for _ in daemon():
pass

def run(self) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be renamed to start

@rdimaio rdimaio force-pushed the patch-6478-daemon-refactor-undertaker branch from 056cb36 to 91599a5 Compare March 12, 2024 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants