Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful shutdown period #2105

Open
ewolak-sq opened this issue Aug 5, 2021 · 8 comments
Open

Graceful shutdown period #2105

ewolak-sq opened this issue Aug 5, 2021 · 8 comments

Comments

@ewolak-sq
Copy link
Collaborator

When deploying a new version of a service behind a load balancer, it's useful to "gracefully" terminate the instances of the old version while directing all new traffic to the instances of the new version. In Kubernetes-land, this is usually done by signaling to the soon-to-be-terminated instances to stop responding successfully to health checks. When the load balancer notices they're failing health checks, it will stop sending new requests to them and direct all traffic to the new instances that are passing health checks. Then, after some reasonable grace period, such as 30 seconds, the old instances can shut down entirely.

From what I understand, Misk's current shutdown behavior is to just race to shutdown (while preserving CoordinatedService ordering), with no provision for a grace period of 1) intentionally failing health checks 2) refusing new requests or 3) both to give the upstream load balancer an opportunity to re-route traffic.

This manifests as spurious errors in services that depend on Misk services, because sometimes they make calls during the brief shutdown period and those calls fail in weird ways due to the rushed shutdown.

@ewolak-sq
Copy link
Collaborator Author

@keeferrourke @swankjesse Do either of you have opinions on this?

@ewolak-sq
Copy link
Collaborator Author

For Kubernetes+Istio specifically, we may not need to do this directly inside of Misk, because Kubernetes will immediately remove the Pod from the Endpoint when terminating, and Istio will pick that up. For any sort of external load balancer, though (such as F5 or AWS ALB), it would still be beneficial to gracefully signal shutdown to the LB before dying.

@r3mariano
Copy link
Collaborator

I was looking into this, and noticed that Misk already configures Jetty to do a graceful shutdown (SO). Then, because JettyService is in a terminating state, the readiness check also fails, even if it's still serving.

@jdm-square
Copy link
Collaborator

I just noticed a poor shutdown experience in merbro. It starts like this (each block is newest to oldest messages due to LQ's ordering, sorry):

2021-10-21 17:38:05.666-0400 INFO merbro m.MiskApplication Thread-1 received a shutdown hook! performing an orderly shutdown  context=default dd.env=staging dd.service=merbro dd.version=0eff61df60dc79b60a117454d8519fb8fc8dae70 region=us-west-2
2021-10-21 17:38:05.658-0400 WARN merbro [envoy] main 1 external/envoy/source/server/server.cc:702] caught ENVOY_SIGTERM  X-Speleo-Span-Id=- X-Speleo-Trace-Id=- envoy={} region=us-west-2
2021-10-21 17:38:05.658-0400 INFO merbro [envoy] main 1 external/envoy/source/server/server.cc:823] shutting down server instance  X-Speleo-Span-Id=- X-Speleo-Trace-Id=- envoy={} region=us-west-2
2021-10-21 17:38:05.658-0400 INFO merbro [envoy] main 1 external/envoy/source/server/server.cc:769] main dispatch loop exited  X-Speleo-Span-Id=- X-Speleo-Trace-Id=- envoy={} region=us-west-2

Then 20 milliseconds later we start shutting down the DB:

2021-10-21 17:38:05.688-0400 INFO merbro m.j.DataSourceService DataSourceService STOPPING Stopping @MerbroDb connection pool  context=default dd.env=staging dd.service=merbro dd.version=0eff61df60dc79b60a117454d8519fb8fc8dae70 region=us-west-2
2021-10-21 17:38:05.688-0400 INFO merbro c.z.h.HikariDataSource DataSourceService STOPPING MerbroDb - Shutdown initiated...  context=default dd.env=staging dd.service=merbro dd.version=0eff61df60dc79b60a117454d8519fb8fc8dae70 region=us-west-2
2021-10-21 17:38:05.687-0400 INFO merbro m.h.SessionFactoryService SessionFactoryService STOPPING Stopped @MerbroDb Hibernate in 1.982 ms  context=default dd.env=staging dd.service=merbro dd.version=0eff61df60dc79b60a117454d8519fb8fc8dae70 region=us-west-2
2021-10-21 17:38:05.685-0400 INFO merbro c.s.c.m.t.SettlementJobEnqueuer SettlementJobEnqueuer STOPPING Stopping SettlementJobEnqueuer  context=default dd.env=staging dd.service=merbro dd.version=0eff61df60dc79b60a117454d8519fb8fc8dae70 region=us-west-2
2021-10-21 17:38:05.685-0400 INFO merbro m.h.SessionFactoryService SessionFactoryService STOPPING Stopping @MerbroDb Hibernate  context=default dd.env=staging dd.service=merbro dd.version=0eff61df60dc79b60a117454d8519fb8fc8dae70 region=us-west-2

Then 80ms later we work on shutting down the containers:

2021-10-21 17:38:05.764-0400 NORMAL merbro Stopping container merbro  kind=Pod kubernetes_verb=ADDED pid=1 region=us-west-2
2021-10-21 17:38:05.764-0400 NORMAL merbro Stopping container istio-proxy  kind=Pod kubernetes_verb=ADDED pid=1 region=us-west-2
2021-10-21 17:38:05.764-0400 NORMAL merbro Stopping container envoy  kind=Pod kubernetes_verb=ADDED pid=1 region=us-west-2

At this point we have our first error caused by work happening after the DB has been shut down. This is an part of servicing an ongoing web action request (underlying error is java.lang.IllegalStateException: EntityManagerFactory is closed):

2021-10-21 17:38:05.782-0400 ERRO merbro c.s.c.m.s.a.e.RealActionExecutor DefaultDispatcher-worker-5 Failed to publish events after executing action  action=Capture authorizationChannel=ONLINE calling_principal=liaison clientId=CASH_CHECKOUT context=default contextToken=MBC_dhwm33jmes3sy3re8ddr43z85 customerToken=CustomerToken(token=C_5cw14pyh0) dd.env=staging dd.service=merbro dd.span_id=2538684043169491495 dd.trace_id=1236162368246818012 dd.version=0eff61df60dc79b60a117454d8519fb8fc8dae70 http_method=POST idempotenceToken=payment_3b0d0dc3-a4b3-48e2-9f04-bc217a400b81 merchantToken=M_e4o22vga paymentState=CAPTURED paymentToken=PWC_0avr93hy9jt5ekretnqfck8rd protocol=HTTP/2.0 region=us-west-2 remote_addr=:0 request_uri=/squareup.cash.merbro.api.MerbroService/CapturePayment span_id=2538684043169491495 trace_id=1236162368246818012

300ms later we report this error to the caller of the action:

2021-10-21 17:38:06.009-0400 ERRO merbro m.w.e.ExceptionHandlingInterceptor jetty-thread-529 unexpected error dispatching to CapturePaymentAction  action=CapturePaymentAction calling_principal=liaison context=default contextToken=MBC_f0z59pe62qstg1bnrjyt6rwda dd.env=staging dd.service=merbro dd.span_id=2538684043169491495 dd.trace_id=1236162368246818012 dd.version=0eff61df60dc79b60a117454d8519fb8fc8dae70 http_method=POST protocol=HTTP/2.0 region=us-west-2 remote_addr=:0 request_uri=/squareup.cash.merbro.api.MerbroService/CapturePayment span_id=2538684043169491495 trace_id=1236162368246818012

We then begin to shut down Jetty:

2021-10-21 17:38:06.014-0400 INFO merbro o.e.j.s.AbstractConnector JettyShutdownThread Stopped https@790ec3b5{SSL, (ssl, alpn, h2, http/1.1)}{0.0.0.0:8443}  context=default dd.env=staging dd.service=merbro dd.version=0eff61df60dc79b60a117454d8519fb8fc8dae70 region=us-west-2
2021-10-21 17:38:06.013-0400 INFO merbro o.e.j.s.AbstractConnector JettyShutdownThread Stopped http@37ed75c9{HTTP/1.1, (http/1.1, h2c)}{0.0.0.0:8080}  context=default dd.env=staging dd.service=merbro dd.version=0eff61df60dc79b60a117454d8519fb8fc8dae70 region=us-west-2

20ms later we encounter another failure when processing an async SQS job that's caused by the DB having been shut down:

2021-10-21 17:38:06.036-0400 ERRO merbro m.j.s.SqsJobConsumer sqs-consumer-15 error handling job from merbro_publish_sync_entities  context=default dd.env=staging dd.service=merbro dd.span_id=6855044826902341604 dd.trace_id=1675206809507854705 dd.version=0eff61df60dc79b60a117454d8519fb8fc8dae70 region=us-west-2 span_id=6855044826902341604 trace_id=1675206809507854705

Jetty is reported stopped 400ms after that:

2021-10-21 17:38:06.439-0400 INFO merbro m.w.j.JettyService JettyService STOPPING Stopping Jetty  context=default dd.env=staging dd.service=merbro dd.version=0eff61df60dc79b60a117454d8519fb8fc8dae70 region=us-west-2

At this point we then shut down the SQS poller:

2021-10-21 17:38:06.441-0400 INFO merbro m.t.RepeatedTaskQueue sqs-consumer-poller the background thread for repeated task queue sqs-consumer-poller is stopping  context=default dd.env=staging dd.service=merbro dd.version=0eff61df60dc79b60a117454d8519fb8fc8dae70 region=us-west-2

And we finally trigger a graceful termination of istio:

2021-10-21 17:38:06.450-0400 INFO merbro istio-proxy Graceful termination period is 5s, starting...  envoy={} region=us-west-2

And a readiness probe begins to return 503:

2021-10-21 17:38:06.768-0400 WARN merbro Readiness probe failed: HTTP probe failed with statuscode: 503  kind=Pod kubernetes_verb=ADDED pid=1 region=us-west-2

The ordering here seems unfortunate. The DB gets shut down before other services that depend upon it are shut down, and it's not clear whether the readiness probe actually returns a failure before jetty finishes shutting down.

@jdm-square
Copy link
Collaborator

I caught another case where we see weird errors because graceful shutdowns do not appear to be taking place.

First merbro receives the signal to start shutting down:

2021-10-27 17:58:47.870-0400 WARN merbro [envoy] main 1 external/envoy/source/server/server.cc:702] caught ENVOY_SIGTERM  X-Speleo-Span-Id=- X-Speleo-Trace-Id=- envoy={} region=us-west-2
2021-10-27 17:58:47.870-0400 INFO merbro [envoy] main 1 external/envoy/source/server/server.cc:823] shutting down server instance  X-Speleo-Span-Id=- X-Speleo-Trace-Id=- envoy={} region=us-west-2
2021-10-27 17:58:47.870-0400 INFO merbro [envoy] main 1 external/envoy/source/server/server.cc:769] main dispatch loop exited  X-Speleo-Span-Id=- X-Speleo-Trace-Id=- envoy={} region=us-west-2

Then a web action that is servicing a request sees an HTTP stream reset error:

2021-10-27 17:58:47.874-0400 WARN merbro m.w.e.ExceptionHandlingInterceptor jetty-thread-497 exception dispatching to CreateMerchantAction  action=CreateMerchantAction calling_principal=liaison context=default contextToken=MBC_ebfx7fbp2v7e9y85kvvybbw1a dd.env=staging dd.service=merbro dd.span_id=3671042484694143931 dd.trace_id=5373424456427776392 dd.version=7272ebe33481c4437ba4760feaf64b743f549b82 http_method=POST protocol=HTTP/2.0 region=us-west-2 remote_addr=:0 request_uri=/squareup.cash.merbro.api.MerbroService/CreateMerchant span_id=3671042484694143931 trace_id=5373424456427776392
2021-10-27 17:58:47.874-0400 WARN merbro m.w.e.ExceptionHandlingInterceptor jetty-thread-497 exception dispatching to CreateMerchantAction  action=CreateMerchantAction calling_principal=liaison context=default contextToken=MBC_ebfx7fbp2v7e9y85kvvybbw1a dd.env=staging dd.service=merbro dd.span_id=3671042484694143931 dd.trace_id=5373424456427776392 dd.version=7272ebe33481c4437ba4760feaf64b743f549b82 http_method=POST protocol=HTTP/2.0 region=us-west-2 remote_addr=:0 request_uri=/squareup.cash.merbro.api.MerbroService/CreateMerchant span_id=3671042484694143931 trace_id=5373424456427776392
2021-10-27 17:58:47.873-0400 INFO merbro m.w.i.RequestLoggingInterceptor jetty-thread-497 CreateMerchantAction principal=liaison time=1.680 s failed request=[MerchantRequest{client_id=CASH_CHECKOUT, merchant=Merchant{business_name=██, country_code=US, currency=USD, category=5611, operation_type=PHYSICAL_RETAIL, location=GlobalAddress{postal_code=██}, avatar_url=https://cashstaging.app/qr/f/OAUTH%3D12345, store_url=cashstaging.app, reference_id=load_tester_639c46af-44b6-4bf7-86e5-021980b51202}}]  action=CreateMerchantAction calling_principal=liaison context=default contextToken=MBC_ebfx7fbp2v7e9y85kvvybbw1a dd.env=staging dd.service=merbro dd.span_id=3671042484694143931 dd.trace_id=5373424456427776392 dd.version=7272ebe33481c4437ba4760feaf64b743f549b82 http_method=POST protocol=HTTP/2.0 region=us-west-2 remote_addr=:0 request_uri=/squareup.cash.merbro.api.MerbroService/CreateMerchant response_code=200 response_time_millis=1679 span_id=3671042484694143931 trace_id=5373424456427776392
2021-10-27 17:58:47.872-0400 INFO merbro m.w.i.RequestBodyLoggingInterceptor jetty-thread-497 CreateMerchantAction principal=liaison failed  action=CreateMerchantAction calling_principal=liaison context=default contextToken=MBC_ebfx7fbp2v7e9y85kvvybbw1a dd.env=staging dd.service=merbro dd.span_id=3671042484694143931 dd.trace_id=5373424456427776392 dd.version=7272ebe33481c4437ba4760feaf64b743f549b82 http_method=POST protocol=HTTP/2.0 region=us-west-2 remote_addr=:0 request_uri=/squareup.cash.merbro.api.MerbroService/CreateMerchant span_id=3671042484694143931 trace_id=5373424456427776392
2

Then jetty reports that is starting to shut down:

2021-10-27 17:58:47.911-0400 INFO merbro o.e.j.s.AbstractConnector JettyShutdownThread Stopped http@5c2bdaed{HTTP/1.1, (http/1.1, h2c)}{0.0.0.0:8080}  context=default dd.env=staging dd.service=merbro dd.version=7272ebe33481c4437ba4760feaf64b743f549b82 region=us-west-2
2021-10-27 17:58:47.909-0400 INFO merbro m.t.RepeatedTaskQueue Thread-1 Triggered shutdown, stopping repeated task queue settlements-task-queue  context=default dd.env=staging dd.service=merbro dd.version=7272ebe33481c4437ba4760feaf64b743f549b82 region=us-west-2
2021-10-27 17:58:47.895-0400 INFO merbro m.MiskApplication Thread-1 received a shutdown hook! performing an orderly shutdown  context=default dd.env=staging dd.service=merbro dd.version=7272ebe33481c4437ba4760feaf64b743f549b82 region=us-west-2

Why are we shutting down the server before we've finished servicing requests? Why can't we make the server stop accepting new requests, then have a reasonable timeout period before shutting down the server and other services like the DB?

@shellderp
Copy link
Collaborator

shellderp commented Nov 4, 2021

This is a good idea, we do it on our internal service container. The best process is roughly:

  1. stop taking new requests (fail /_readiness, but keep jetty alive so /_liveness can be hit)
  2. tell every service to shut down (e.g. stop taking new sqs jobs, but finish processing already started jobs)
  3. sleep for a configurable time (I guess.. have a new service that depends on JettyService whose shutdown method is sleep(configged-time))
  4. exit

@jdm-square
Copy link
Collaborator

I found something interesting when poking at this:

  • * If >= 0, use a dedicated jetty thread pool for health checking.
    *
    * A dedicated thread pool ensures that health checks are not queued or rejected when the service
    * is saturated and queueing requests. If health checks are rejected and/or queued, the health
    * checks may fail and k8s will kill the container, even though it might be perfectly healthy. This
    * can cause cascading failures by sending more requests to other containers, resulting in longer
    * queues and more health checks failures.
    *
    * TODO(rhall): make this required
    */
    val health_port: Int = -1,
  • if (webConfig.health_port >= 0) {
    healthExecutor = ThreadPoolExecutor(
    // 2 threads for jetty acceptor and selector. 2 threads for k8s liveness/readiness.
    4,
    // Jetty can be flaky about rejecting near full capacity, so allow some growth.
    8,
    60L, TimeUnit.SECONDS,
    SynchronousQueue(),
    ThreadFactoryBuilder()
    .setNameFormat("jetty-health-%d")
    .build()
    )
  • if (healthExecutor != null) {
    healthExecutor!!.shutdown()
    healthExecutor!!.awaitTermination(10, TimeUnit.SECONDS)

Would setting a non-default health_port in our service config allow us to experiment with whether a shutdown sleep would address this issue? I'm having trouble figuring out the implications of the 10s timeout during the JettyService shutdown with the health executor enabled.

@jdm-square
Copy link
Collaborator

Update: #2238 merged, and is a stopgap solution. Affected services (and there are a lot of them) can set the shutdown_sleep_ms web config property to a non-zero value to give themselves a chance to finish processing requests before services begin shutting down. I still intend to implement the following as well:

  • support misk apps marking dependencies between the jetty service and other services
  • modifying the liveness and readiness checks to behave differently once the app begins to shut down

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants