-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
signal: manage shutdown requests with status codes #8659
base: master
Are you sure you want to change the base?
signal: manage shutdown requests with status codes #8659
Conversation
Important Auto Review SkippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
signal/signal.go
Outdated
// exitCode represents a channel for receiving exit codes. | ||
// It is used to communicate the exit status when the main | ||
// interrupt handler exits. | ||
exitCode chan int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use a channel for this? channels for the rest make sense to me cause other sub-systems might wait on these but we would not wait on a specific exit code to be set.
I think this can just be a mutex protected int and probably need a bool to indicate that we have in fact exited... ie, GetExitCode can return an int along with a bool. If the bool is false, it means we have not exited yet.
@@ -630,7 +630,8 @@ var _ lnrpc.LightningServer = (*rpcServer)(nil) | |||
// dependencies are added, this will be an non-functioning RPC server only to | |||
// be used to register the LightningService with the gRPC server. | |||
func newRPCServer(cfg *Config, interceptorChain *rpcperms.InterceptorChain, | |||
implCfg *ImplementationCfg, interceptor signal.Interceptor) *rpcServer { | |||
implCfg *ImplementationCfg, | |||
interceptor *signal.Interceptor) *rpcServer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the pointer change just for cleanliness? If so, should be in a separate commit (before the main commit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was for cleanliness purposes. However, since we are going to use a mutex-protected integer, it may no longer serve cleanliness purposes only. Passing a lock (i.e., sync. Mutex
) can lead to concurrency issues because the lock's state will be duplicated in the copy, and changes to one copy will not be reflected in the other. This can result in race conditions and other synchronization problems. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure but then let's do it in its own commit?
@@ -274,6 +274,10 @@ bitcoin peers' feefilter values into account](https://github.com/lightningnetwor | |||
maintain a healthy connection to the network by checking the number of | |||
outbound peers if they are below 6. | |||
|
|||
* [exitNoSuccessOnShutdown](https://github.com/lightningnetwork/lnd/pull/8659) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exitNoSuccessOnShutdown
is the branch name... doesnt mean much to the users reading the change log :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, when I updated the release notes I didn't have text in mind to fit in the 80 chars limit (since URL takes much) so I went with the branch name :D.
Gonna update this!
signal/signal.go
Outdated
// message from shutdownRequest Channel since it would | ||
// exit with an error status after graceful shutdown | ||
// when we close the quit channel. | ||
c.setExitCode(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm im not sure this is correct.
shutdownRequestChannel
could be closed if the StopDaemon
rpc method is called. If LND shuts down smoothly after that, then the exit code should be 0, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps instead RequestShutdown
should have an error code param... so that call-sights can determine what the error code should be.... but then also, even if a call-sight indicates a 0
error code, the shutdown process itself should be able to downgrade that to a 1 ofc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shutdownRequestChannel could be closed if the StopDaemon rpc method is called. If LND shuts down smoothly after that, then the exit code should be 0, right?
Ah - you can see this in action if you take a look at the failing itests. They all call StopDaemon
to shutdown the various nodes but now exit code of each process is 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, clicked wrong button when submitting review :)
07a3307
to
f35a0e7
Compare
f35a0e7
to
a01ad89
Compare
Thank you so much, @ellemouton, for your feedback! I've made the changes you suggested. Please let me know if there's anything else I need to adjust. Side-Updates: Updated the approach diagram and the commit message to handle if the user called the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Learned a lot while reviewing this code and it is very thoughtful of you to include a diagram in your PR description. Huge kudos and thanks. Left some comments and question.
@@ -41,4 +41,9 @@ func main() { | |||
_, _ = fmt.Fprintln(os.Stderr, err) | |||
os.Exit(1) | |||
} | |||
|
|||
exitCode, exited := shutdownInterceptor.GetExitCode() | |||
if exited { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to check if it has exited here? Would we ever encounter this code path when we are not exiting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right I don't think we don't need it since the status code is returned. I maybe misunderstood @ellemouton comment:
probably need a bool to indicate that we have in fact exited... ie, GetExitCode can return an int along with a bool. If the bool is false, it means we have not exited yet.
My view is that if we wait for a lock to be released and retrieve the status code (0/1), it implies the process has already exited so no need for additional flag. May also @ellemouton confirm if that's true?
|
||
// GetExitCode returns the exit code of the interceptor, which represents | ||
// the exit status when the main interrupt handler exits. It blocks until | ||
// the exit code is available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it truly block till the exit code is available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes since we are waiting for a lock to be available here c.exitCodeMutex.Lock()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, I think writing that it is safe for concurrent use would be clearer.
Also, I do not think there is need for a behaviour to block in this function as this function is called when all other subsystems have stopped so the exit code should be readily available then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you, but since we protect exitCode
with a mutex, it may be a good practice to wait for it. This is because we return the same exitCode
in GetExitCode
, although we're sure that the lock is not held ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking that maybe the lock might not be needed? #8659 (comment)
log.Infof("Received shutdown request.") | ||
|
||
// Set the interceptor exit code status based on the | ||
// provided error code i.e We set exit code 1 for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// provided error code i.e We set exit code 1 for | |
// provided error code i.e. we set exit code 1 for |
exitCode int | ||
|
||
// exitCodeMutex protects the exitCode. | ||
exitCodeMutex sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would exitCode be written to or read from multiple threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what threads you are referring to here but multiple subsystems send a signal to interruptChannel
or shutdownRequestChannel
we want to make sure there is no racing on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on top of that referred to protecting the exitCode and not the channels which you are now referring to. Plus this mutex is not exported so other subsystems can't really call it. Additionally, I don't think it would result to a race condition if multiple systems send to a golang channel. What you might want to do is queue the signals for a more deterministic behaviour but that might be an overkill for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on top of that referred to protecting the exitCode and not the channels which you are now referring to.
What I meant is When the subsystems send signals to the channels (interruptChannel and shutdownRequestChannel) this is where the code for setting the exit status code exists so that's why we used mutex protected int to prevent racing on exit code.
I'm new to this part of codebase so I may be totally wrong my general understanding of this is that every subsystem sends a RequestShutdownRequest
to the shutdown channel:
Line 104 in fb632bb
interceptor.RequestShutdown() |
and also StopDaemon
RPC sends a RequestShutdownRequest
here:
Line 6570 in fb632bb
func (r *rpcServer) StopDaemon(_ context.Context, |
And the interruptChannel
intercepting signals from the OS where we handle exit code status 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is where the code for setting the exit status code exists so that's why we used mutex protected int to prevent racing on exit code.
Not an expert myself but from what you explained here and which I can see is what is happening in the code.
We do not have multiple subsystems setting the exit code. The setExitCode
function is only called in one place the mainInterruptHandler
goroutine. Once that is set, lnd and accompanying subsystems shutdown and so there would be no concurrency issue fetching that exitcode here:
Line 45 in f27a64e
exitCode, exited := shutdownInterceptor.GetExitCode() |
So we might not need the mutex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! The reason that set the exit code once by checking if the interceptor is not already killed in case of RequestShutdown
:
Lines 100 to 102 in fb632bb
if !interceptor.Listening() { | |
return | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for @ellemouton conf before proceeding with the change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we take a step back, I think the main thing we need to happen here is we need lnd.Main
to actually return an error. Ie, it should not be the case that lnd.Main
returns a nil error but then we need to call GetErrorCode
just in case. Especially since we are already waiting on a signal
channel at the very end of the lnd.Main
function. So instead I think we should listen on that & depending on what it returns, we should actually return an error from lnd.Main
.
Here is a rough patch on top of your PR here showing an idea of how this can be simplified quite a bit:
signal.patch
My bad for leading you in the wrong direction on the last review! totally my fault - I should have thought it through a bit more.
Let me know what you think of the idea in the patch 馃檹
@@ -277,6 +277,9 @@ bitcoin peers' feefilter values into account](https://github.com/lightningnetwor | |||
maintain a healthy connection to the network by checking the number of | |||
outbound peers if they are below 6. | |||
|
|||
* [Shutdown Exit Handling](https://github.com/lightningnetwork/lnd/pull/8659) | |||
is added to manage shutdowns with status codes. Exits with code 1 for critical errors, and code 0 for normal shutdowns (e.g., from StopDaemon RPC call). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls wrap at 80 馃檹
@@ -630,7 +630,8 @@ var _ lnrpc.LightningServer = (*rpcServer)(nil) | |||
// dependencies are added, this will be an non-functioning RPC server only to | |||
// be used to register the LightningService with the gRPC server. | |||
func newRPCServer(cfg *Config, interceptorChain *rpcperms.InterceptorChain, | |||
implCfg *ImplementationCfg, interceptor signal.Interceptor) *rpcServer { | |||
implCfg *ImplementationCfg, | |||
interceptor *signal.Interceptor) *rpcServer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure but then let's do it in its own commit?
} | ||
|
||
channels := Interceptor{ | ||
interruptChannel: make(chan os.Signal, 1), | ||
shutdownChannel: make(chan struct{}), | ||
shutdownRequestChannel: make(chan struct{}), | ||
shutdownRequestChannel: make(chan int, 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the reason for buffering this now?
// exited indicates whether the application has exited. | ||
exited bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can actually just use the status of the quit channel to know if we have exited or not. See the func (c *Interceptor) Alive() bool
method
// the exit status when the main interrupt handler exits. It blocks until | ||
// the exit code is available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"It blocks until the exit code is available."
i dont think that's true
@mohamedawnallah, remember to re-request review from reviewers when ready |
Change Description
Replaces #5636
Closes #5625
Steps to Test
lnd.conf
or you could pass it throughlncli
.bitcoind
daemon making sure it is functional.lnd
making sure it is functional and synced with the chain backend.bitcoind
on purpose.lnd
shutting down gracefullyecho $?
the expected is1
in our case previously it was0
.Approach
Pull Request Checklist
Testing
Code Style and Documentation
[skip ci]
in the commit message for small changes.馃摑 Please see our Contribution Guidelines for further guidance.