Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

signal: manage shutdown requests with status codes #8659

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mohamedawnallah
Copy link
Contributor

@mohamedawnallah mohamedawnallah commented Apr 17, 2024

Change Description

Replaces #5636
Closes #5625

Steps to Test

  1. Make sure you have configured chain backend health checks in the lnd.conf or you could pass it through lncli.
  2. Run bitcoind daemon making sure it is functional.
  3. Run lnd making sure it is functional and synced with the chain backend.
  4. Fail bitcoind on purpose.
  5. Exhaust all the attempts of checking the chain backend health check (the default is 3)
  6. After that you will notice lnd shutting down gracefully
  7. Get the previous status using this command echo $? the expected is 1 in our case previously it was 0.

Approach

Approach

Pull Request Checklist

Testing

  • Your PR passes all CI checks.
  • Tests covering the positive and negative (error paths) are included.
  • Bug fixes contain tests triggering the bug to prevent regressions.

Code Style and Documentation

馃摑 Please see our Contribution Guidelines for further guidance.

Copy link

coderabbitai bot commented Apr 17, 2024

Important

Auto Review Skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

signal/signal.go Outdated
// exitCode represents a channel for receiving exit codes.
// It is used to communicate the exit status when the main
// interrupt handler exits.
exitCode chan int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use a channel for this? channels for the rest make sense to me cause other sub-systems might wait on these but we would not wait on a specific exit code to be set.

I think this can just be a mutex protected int and probably need a bool to indicate that we have in fact exited... ie, GetExitCode can return an int along with a bool. If the bool is false, it means we have not exited yet.

@@ -630,7 +630,8 @@ var _ lnrpc.LightningServer = (*rpcServer)(nil)
// dependencies are added, this will be an non-functioning RPC server only to
// be used to register the LightningService with the gRPC server.
func newRPCServer(cfg *Config, interceptorChain *rpcperms.InterceptorChain,
implCfg *ImplementationCfg, interceptor signal.Interceptor) *rpcServer {
implCfg *ImplementationCfg,
interceptor *signal.Interceptor) *rpcServer {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the pointer change just for cleanliness? If so, should be in a separate commit (before the main commit)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was for cleanliness purposes. However, since we are going to use a mutex-protected integer, it may no longer serve cleanliness purposes only. Passing a lock (i.e., sync. Mutex) can lead to concurrency issues because the lock's state will be duplicated in the copy, and changes to one copy will not be reflected in the other. This can result in race conditions and other synchronization problems. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure but then let's do it in its own commit?

@@ -274,6 +274,10 @@ bitcoin peers' feefilter values into account](https://github.com/lightningnetwor
maintain a healthy connection to the network by checking the number of
outbound peers if they are below 6.

* [exitNoSuccessOnShutdown](https://github.com/lightningnetwork/lnd/pull/8659)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exitNoSuccessOnShutdown is the branch name... doesnt mean much to the users reading the change log :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, when I updated the release notes I didn't have text in mind to fit in the 80 chars limit (since URL takes much) so I went with the branch name :D.

Gonna update this!

signal/signal.go Outdated
// message from shutdownRequest Channel since it would
// exit with an error status after graceful shutdown
// when we close the quit channel.
c.setExitCode(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm im not sure this is correct.

shutdownRequestChannel could be closed if the StopDaemon rpc method is called. If LND shuts down smoothly after that, then the exit code should be 0, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps instead RequestShutdown should have an error code param... so that call-sights can determine what the error code should be.... but then also, even if a call-sight indicates a 0 error code, the shutdown process itself should be able to downgrade that to a 1 ofc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shutdownRequestChannel could be closed if the StopDaemon rpc method is called. If LND shuts down smoothly after that, then the exit code should be 0, right?

Ah - you can see this in action if you take a look at the failing itests. They all call StopDaemon to shutdown the various nodes but now exit code of each process is 1

Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, clicked wrong button when submitting review :)

@mohamedawnallah mohamedawnallah changed the title signal: exit without success code on shutdown request signal: manage shutdown requests with status codes Apr 19, 2024
@mohamedawnallah
Copy link
Contributor Author

mohamedawnallah commented Apr 19, 2024

Thank you so much, @ellemouton, for your feedback! I've made the changes you suggested. Please let me know if there's anything else I need to adjust.

Side-Updates: Updated the approach diagram and the commit message to handle if the user called the StopDaemon RPC.

Copy link
Contributor

@Chinwendu20 Chinwendu20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Learned a lot while reviewing this code and it is very thoughtful of you to include a diagram in your PR description. Huge kudos and thanks. Left some comments and question.

@@ -41,4 +41,9 @@ func main() {
_, _ = fmt.Fprintln(os.Stderr, err)
os.Exit(1)
}

exitCode, exited := shutdownInterceptor.GetExitCode()
if exited {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to check if it has exited here? Would we ever encounter this code path when we are not exiting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right I don't think we don't need it since the status code is returned. I maybe misunderstood @ellemouton comment:

probably need a bool to indicate that we have in fact exited... ie, GetExitCode can return an int along with a bool. If the bool is false, it means we have not exited yet.

My view is that if we wait for a lock to be released and retrieve the status code (0/1), it implies the process has already exited so no need for additional flag. May also @ellemouton confirm if that's true?


// GetExitCode returns the exit code of the interceptor, which represents
// the exit status when the main interrupt handler exits. It blocks until
// the exit code is available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it truly block till the exit code is available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes since we are waiting for a lock to be available here c.exitCodeMutex.Lock().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, I think writing that it is safe for concurrent use would be clearer.

Also, I do not think there is need for a behaviour to block in this function as this function is called when all other subsystems have stopped so the exit code should be readily available then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, but since we protect exitCode with a mutex, it may be a good practice to wait for it. This is because we return the same exitCode in GetExitCode, although we're sure that the lock is not held ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that maybe the lock might not be needed? #8659 (comment)

log.Infof("Received shutdown request.")

// Set the interceptor exit code status based on the
// provided error code i.e We set exit code 1 for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// provided error code i.e We set exit code 1 for
// provided error code i.e. we set exit code 1 for

exitCode int

// exitCodeMutex protects the exitCode.
exitCodeMutex sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would exitCode be written to or read from multiple threads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what threads you are referring to here but multiple subsystems send a signal to interruptChannel or shutdownRequestChannel we want to make sure there is no racing on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on top of that referred to protecting the exitCode and not the channels which you are now referring to. Plus this mutex is not exported so other subsystems can't really call it. Additionally, I don't think it would result to a race condition if multiple systems send to a golang channel. What you might want to do is queue the signals for a more deterministic behaviour but that might be an overkill for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on top of that referred to protecting the exitCode and not the channels which you are now referring to.

What I meant is When the subsystems send signals to the channels (interruptChannel and shutdownRequestChannel) this is where the code for setting the exit status code exists so that's why we used mutex protected int to prevent racing on exit code.

I'm new to this part of codebase so I may be totally wrong my general understanding of this is that every subsystem sends a RequestShutdownRequest to the shutdown channel:

lnd/log.go

Line 104 in fb632bb

interceptor.RequestShutdown()

and also StopDaemon RPC sends a RequestShutdownRequest here:

lnd/rpcserver.go

Line 6570 in fb632bb

func (r *rpcServer) StopDaemon(_ context.Context,

And the interruptChannel intercepting signals from the OS where we handle exit code status 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where the code for setting the exit status code exists so that's why we used mutex protected int to prevent racing on exit code.

Not an expert myself but from what you explained here and which I can see is what is happening in the code.
We do not have multiple subsystems setting the exit code. The setExitCode function is only called in one place the mainInterruptHandler goroutine. Once that is set, lnd and accompanying subsystems shutdown and so there would be no concurrency issue fetching that exitcode here:

exitCode, exited := shutdownInterceptor.GetExitCode()

So we might not need the mutex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! The reason that set the exit code once by checking if the interceptor is not already killed in case of RequestShutdown:

lnd/log.go

Lines 100 to 102 in fb632bb

if !interceptor.Listening() {
return
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for @ellemouton conf before proceeding with the change!

Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we take a step back, I think the main thing we need to happen here is we need lnd.Main to actually return an error. Ie, it should not be the case that lnd.Main returns a nil error but then we need to call GetErrorCode just in case. Especially since we are already waiting on a signal channel at the very end of the lnd.Main function. So instead I think we should listen on that & depending on what it returns, we should actually return an error from lnd.Main.

Here is a rough patch on top of your PR here showing an idea of how this can be simplified quite a bit:
signal.patch

My bad for leading you in the wrong direction on the last review! totally my fault - I should have thought it through a bit more.

Let me know what you think of the idea in the patch 馃檹

@@ -277,6 +277,9 @@ bitcoin peers' feefilter values into account](https://github.com/lightningnetwor
maintain a healthy connection to the network by checking the number of
outbound peers if they are below 6.

* [Shutdown Exit Handling](https://github.com/lightningnetwork/lnd/pull/8659)
is added to manage shutdowns with status codes. Exits with code 1 for critical errors, and code 0 for normal shutdowns (e.g., from StopDaemon RPC call).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls wrap at 80 馃檹

@@ -630,7 +630,8 @@ var _ lnrpc.LightningServer = (*rpcServer)(nil)
// dependencies are added, this will be an non-functioning RPC server only to
// be used to register the LightningService with the gRPC server.
func newRPCServer(cfg *Config, interceptorChain *rpcperms.InterceptorChain,
implCfg *ImplementationCfg, interceptor signal.Interceptor) *rpcServer {
implCfg *ImplementationCfg,
interceptor *signal.Interceptor) *rpcServer {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure but then let's do it in its own commit?

}

channels := Interceptor{
interruptChannel: make(chan os.Signal, 1),
shutdownChannel: make(chan struct{}),
shutdownRequestChannel: make(chan struct{}),
shutdownRequestChannel: make(chan int, 1),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason for buffering this now?

Comment on lines +121 to +122
// exited indicates whether the application has exited.
exited bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can actually just use the status of the quit channel to know if we have exited or not. See the func (c *Interceptor) Alive() bool method

Comment on lines +257 to +258
// the exit status when the main interrupt handler exits. It blocks until
// the exit code is available.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"It blocks until the exit code is available."

i dont think that's true

@lightninglabs-deploy
Copy link

@mohamedawnallah, remember to re-request review from reviewers when ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lnd should exit with no-success error code when automatically shut down due to bitcoind problems
4 participants