signal: manage shutdown requests with status codes #8659

mohamedawnallah · 2024-04-17T18:32:48Z

Change Description

Replaces #5636
Closes #5625

Steps to Test

Make sure you have configured chain backend health checks in the lnd.conf or you could pass it through lncli.
Run bitcoind daemon making sure it is functional.
Run lnd making sure it is functional and synced with the chain backend.
Fail bitcoind on purpose.
Exhaust all the attempts of checking the chain backend health check (the default is 3)
After that you will notice lnd shutting down gracefully
Get the previous status using this command echo $? the expected is 1 in our case previously it was 0.

Approach

Pull Request Checklist

Testing

Your PR passes all CI checks.
Tests covering the positive and negative (error paths) are included.
Bug fixes contain tests triggering the bug to prevent regressions.

Code Style and Documentation

The change obeys the Code Documentation and Commenting guidelines, and lines wrap at 80.
Commits follow the Ideal Git Commit Structure.
Any new logging statements use an appropriate subsystem and logging level.
Any new lncli commands have appropriate tags in the comments for the rpc in the proto file.
There is a change description in the release notes, or [skip ci] in the commit message for small changes.

📝 Please see our Contribution Guidelines for further guidance.

coderabbitai · 2024-04-17T18:32:54Z

Important

Auto Review Skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

ellemouton · 2024-04-17T18:47:32Z

signal/signal.go

+	// exitCode represents a channel for receiving exit codes.
+	// It is used to communicate the exit status when the main
+	// interrupt handler exits.
+	exitCode chan int


why use a channel for this? channels for the rest make sense to me cause other sub-systems might wait on these but we would not wait on a specific exit code to be set.

I think this can just be a mutex protected int and probably need a bool to indicate that we have in fact exited... ie, GetExitCode can return an int along with a bool. If the bool is false, it means we have not exited yet.

ellemouton · 2024-04-17T18:47:56Z

rpcserver.go

@@ -630,7 +630,8 @@ var _ lnrpc.LightningServer = (*rpcServer)(nil)
 // dependencies are added, this will be an non-functioning RPC server only to
 // be used to register the LightningService with the gRPC server.
 func newRPCServer(cfg *Config, interceptorChain *rpcperms.InterceptorChain,
-	implCfg *ImplementationCfg, interceptor signal.Interceptor) *rpcServer {
+	implCfg *ImplementationCfg,
+	interceptor *signal.Interceptor) *rpcServer {


is the pointer change just for cleanliness? If so, should be in a separate commit (before the main commit)

Yes, it was for cleanliness purposes. However, since we are going to use a mutex-protected integer, it may no longer serve cleanliness purposes only. Passing a lock (i.e., sync. Mutex) can lead to concurrency issues because the lock's state will be duplicated in the copy, and changes to one copy will not be reflected in the other. This can result in race conditions and other synchronization problems. What do you think?

sure but then let's do it in its own commit?

ellemouton · 2024-04-17T18:48:35Z

docs/release-notes/release-notes-0.18.0.md

@@ -274,6 +274,10 @@ bitcoin peers' feefilter values into account](https://github.com/lightningnetwor
  maintain a healthy connection to the network by checking the number of
  outbound peers if they are below 6.

+* [exitNoSuccessOnShutdown](https://github.com/lightningnetwork/lnd/pull/8659)


exitNoSuccessOnShutdown is the branch name... doesnt mean much to the users reading the change log :)

Yeah, when I updated the release notes I didn't have text in mind to fit in the 80 chars limit (since URL takes much) so I went with the branch name :D.

Gonna update this!

ellemouton · 2024-04-17T18:52:39Z

signal/signal.go

+			// message from shutdownRequest Channel since it would
+			// exit with an error status after graceful shutdown
+			// when we close the quit channel.
+			c.setExitCode(1)


hmmm im not sure this is correct.

shutdownRequestChannel could be closed if the StopDaemon rpc method is called. If LND shuts down smoothly after that, then the exit code should be 0, right?

perhaps instead RequestShutdown should have an error code param... so that call-sights can determine what the error code should be.... but then also, even if a call-sight indicates a 0 error code, the shutdown process itself should be able to downgrade that to a 1 ofc

shutdownRequestChannel could be closed if the StopDaemon rpc method is called. If LND shuts down smoothly after that, then the exit code should be 0, right?

Ah - you can see this in action if you take a look at the failing itests. They all call StopDaemon to shutdown the various nodes but now exit code of each process is 1

ellemouton

oops, clicked wrong button when submitting review :)

mohamedawnallah · 2024-04-19T00:35:36Z

Thank you so much, @ellemouton, for your feedback! I've made the changes you suggested. Please let me know if there's anything else I need to adjust.

Side-Updates: Updated the approach diagram and the commit message to handle if the user called the StopDaemon RPC.

Chinwendu20

Learned a lot while reviewing this code and it is very thoughtful of you to include a diagram in your PR description. Huge kudos and thanks. Left some comments and question.

Chinwendu20 · 2024-05-01T18:58:15Z

cmd/lnd/main.go

@@ -41,4 +41,9 @@ func main() {
 		_, _ = fmt.Fprintln(os.Stderr, err)
 		os.Exit(1)
 	}
+
+	exitCode, exited := shutdownInterceptor.GetExitCode()
+	if exited {


Why do we need to check if it has exited here? Would we ever encounter this code path when we are not exiting?

You're right I don't think we don't need it since the status code is returned. I maybe misunderstood @ellemouton comment:

probably need a bool to indicate that we have in fact exited... ie, GetExitCode can return an int along with a bool. If the bool is false, it means we have not exited yet.

My view is that if we wait for a lock to be released and retrieve the status code (0/1), it implies the process has already exited so no need for additional flag. May also @ellemouton confirm if that's true?

Chinwendu20 · 2024-05-01T18:59:38Z

signal/signal.go

+
+// GetExitCode returns the exit code of the interceptor, which represents
+// the exit status when the main interrupt handler exits. It blocks until
+// the exit code is available.


Does it truly block till the exit code is available?

Yes since we are waiting for a lock to be available here c.exitCodeMutex.Lock().

IMO, I think writing that it is safe for concurrent use would be clearer.

Also, I do not think there is need for a behaviour to block in this function as this function is called when all other subsystems have stopped so the exit code should be readily available then.

I agree with you, but since we protect exitCode with a mutex, it may be a good practice to wait for it. This is because we return the same exitCode in GetExitCode, although we're sure that the lock is not held ?

I am thinking that maybe the lock might not be needed? #8659 (comment)

Chinwendu20 · 2024-05-01T19:04:46Z

signal/signal.go

 			log.Infof("Received shutdown request.")
+
+			// Set the interceptor exit code status based on the
+			// provided error code i.e We set exit code 1 for


Suggested change

// provided error code i.e We set exit code 1 for

// provided error code i.e. we set exit code 1 for

Chinwendu20 · 2024-05-01T19:09:21Z

signal/signal.go

+	exitCode int
+
+	// exitCodeMutex protects the exitCode.
+	exitCodeMutex sync.Mutex


would exitCode be written to or read from multiple threads?

Not sure what threads you are referring to here but multiple subsystems send a signal to interruptChannel or shutdownRequestChannel we want to make sure there is no racing on this.

The comment on top of that referred to protecting the exitCode and not the channels which you are now referring to. Plus this mutex is not exported so other subsystems can't really call it. Additionally, I don't think it would result to a race condition if multiple systems send to a golang channel. What you might want to do is queue the signals for a more deterministic behaviour but that might be an overkill for this PR.

The comment on top of that referred to protecting the exitCode and not the channels which you are now referring to.

What I meant is When the subsystems send signals to the channels (interruptChannel and shutdownRequestChannel) this is where the code for setting the exit status code exists so that's why we used mutex protected int to prevent racing on exit code.

I'm new to this part of codebase so I may be totally wrong my general understanding of this is that every subsystem sends a RequestShutdownRequest to the shutdown channel:

lnd/log.go

Line 104 in fb632bb

interceptor.RequestShutdown()

and also StopDaemon RPC sends a RequestShutdownRequest here:

lnd/rpcserver.go

Line 6570 in fb632bb

func (r *rpcServer) StopDaemon(_ context.Context,

And the interruptChannel intercepting signals from the OS where we handle exit code status 0.

this is where the code for setting the exit status code exists so that's why we used mutex protected int to prevent racing on exit code.

Not an expert myself but from what you explained here and which I can see is what is happening in the code.
We do not have multiple subsystems setting the exit code. The setExitCode function is only called in one place the mainInterruptHandler goroutine. Once that is set, lnd and accompanying subsystems shutdown and so there would be no concurrency issue fetching that exitcode here:

lnd/cmd/lnd/main.go

Line 45 in f27a64e

exitCode, exited := shutdownInterceptor.GetExitCode()

So we might not need the mutex?

I agree! The reason that set the exit code once by checking if the interceptor is not already killed in case of RequestShutdown:

lnd/log.go

Lines 100 to 102 in fb632bb

if !interceptor.Listening() {

return

}

Waiting for @ellemouton conf before proceeding with the change!

ellemouton

If we take a step back, I think the main thing we need to happen here is we need lnd.Main to actually return an error. Ie, it should not be the case that lnd.Main returns a nil error but then we need to call GetErrorCode just in case. Especially since we are already waiting on a signal channel at the very end of the lnd.Main function. So instead I think we should listen on that & depending on what it returns, we should actually return an error from lnd.Main.

Here is a rough patch on top of your PR here showing an idea of how this can be simplified quite a bit:
signal.patch

My bad for leading you in the wrong direction on the last review! totally my fault - I should have thought it through a bit more.

Let me know what you think of the idea in the patch 🙏

ellemouton · 2024-04-22T15:31:11Z

docs/release-notes/release-notes-0.18.0.md

@@ -277,6 +277,9 @@ bitcoin peers' feefilter values into account](https://github.com/lightningnetwor
  maintain a healthy connection to the network by checking the number of
  outbound peers if they are below 6.

+* [Shutdown Exit Handling](https://github.com/lightningnetwork/lnd/pull/8659)
+  is added to manage shutdowns with status codes. Exits with code 1 for critical errors, and code 0 for normal shutdowns (e.g., from StopDaemon RPC call).


pls wrap at 80 🙏

ellemouton · 2024-04-22T15:37:38Z

rpcserver.go

@@ -630,7 +630,8 @@ var _ lnrpc.LightningServer = (*rpcServer)(nil)
 // dependencies are added, this will be an non-functioning RPC server only to
 // be used to register the LightningService with the gRPC server.
 func newRPCServer(cfg *Config, interceptorChain *rpcperms.InterceptorChain,
-	implCfg *ImplementationCfg, interceptor signal.Interceptor) *rpcServer {
+	implCfg *ImplementationCfg,
+	interceptor *signal.Interceptor) *rpcServer {


sure but then let's do it in its own commit?

ellemouton · 2024-04-22T15:37:56Z

signal/signal.go

 	}

 	channels := Interceptor{
 		interruptChannel:       make(chan os.Signal, 1),
 		shutdownChannel:        make(chan struct{}),
-		shutdownRequestChannel: make(chan struct{}),
+		shutdownRequestChannel: make(chan int, 1),


what's the reason for buffering this now?

ellemouton · 2024-04-22T15:39:11Z

signal/signal.go

+	// exited indicates whether the application has exited.
+	exited bool


I think we can actually just use the status of the quit channel to know if we have exited or not. See the func (c *Interceptor) Alive() bool method

ellemouton · 2024-04-22T15:39:51Z

signal/signal.go

+// the exit status when the main interrupt handler exits. It blocks until
+// the exit code is available.


"It blocks until the exit code is available."

i dont think that's true

lightninglabs-deploy · 2024-05-23T12:22:10Z

@mohamedawnallah, remember to re-request review from reviewers when ready

mohamedawnallah mentioned this pull request Apr 17, 2024

signal: custom exit code status for interceptor #5636

Closed

10 tasks

saubyk assigned mohamedawnallah Apr 17, 2024

ellemouton reviewed Apr 17, 2024

View reviewed changes

ellemouton requested changes Apr 17, 2024

View reviewed changes

mohamedawnallah force-pushed the exitNoSuccessOnShutdown branch from 07a3307 to f35a0e7 Compare April 19, 2024 00:14

mohamedawnallah changed the title ~~signal: exit without success code on shutdown request~~ signal: manage shutdown requests with status codes Apr 19, 2024

mohamedawnallah added 2 commits April 19, 2024 02:19

multi: manage shutdown requests with status codes

f27a64e

docs: update release-notes-0.18.0.md

a01ad89

mohamedawnallah force-pushed the exitNoSuccessOnShutdown branch from f35a0e7 to a01ad89 Compare April 19, 2024 00:20

mohamedawnallah requested a review from ellemouton April 19, 2024 01:55

Chinwendu20 reviewed May 1, 2024

View reviewed changes

ellemouton reviewed May 16, 2024

View reviewed changes

	// provided error code i.e We set exit code 1 for
	// provided error code i.e. we set exit code 1 for

		// exited indicates whether the application has exited.
		exited bool

		// the exit status when the main interrupt handler exits. It blocks until
		// the exit code is available.

signal: manage shutdown requests with status codes #8659

Are you sure you want to change the base?

signal: manage shutdown requests with status codes #8659

Conversation

mohamedawnallah commented Apr 17, 2024 • edited

Change Description

Steps to Test

Approach

Pull Request Checklist

Testing

Code Style and Documentation

coderabbitai bot commented Apr 17, 2024

Auto Review Skipped

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

Documentation and Community

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellemouton left a comment

Choose a reason for hiding this comment

mohamedawnallah commented Apr 19, 2024 • edited

Chinwendu20 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mohamedawnallah May 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Chinwendu20 May 2, 2024 • edited

Choose a reason for hiding this comment

mohamedawnallah May 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellemouton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lightninglabs-deploy commented May 23, 2024

mohamedawnallah commented Apr 17, 2024 •

edited

CodeRabbit Configration File (`.coderabbit.yaml`)

mohamedawnallah commented Apr 19, 2024 •

edited

mohamedawnallah May 2, 2024 •

edited

Chinwendu20 May 2, 2024 •

edited

mohamedawnallah May 2, 2024 •

edited