Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check node or cluster health in post-start script #87

Open
MatthiasWinzeler opened this issue Jun 13, 2018 · 5 comments
Open

Check node or cluster health in post-start script #87

MatthiasWinzeler opened this issue Jun 13, 2018 · 5 comments

Comments

@MatthiasWinzeler
Copy link

Hi all

While changing configuration of our rabbitmq deployment lately, we ran into the following issue:

  • Canary was updated. Canary came up (process was running), but Rabbitmq could not start properly or crashed. This rendered the node useless - however, since the process was running, monit reported 'running' and moved on
  • The second node was updated as well, also coming up unhealthy, but also with a running process.
  • The third node had the same issue.

Result: The whole cluster was down. The post-deploy script then failed on the nodes, so we noticed something was wrong.

If we would add a post-start script, this scenario could not happen.
Since post-start are run on each VM before BOSH marks it as healthy, individual node failures would be caught before moving on with the next instance.

UAA already does that in a similar fashion: https://github.com/cloudfoundry/uaa-release/blob/develop/jobs/uaa/templates/bin/post-start

What do you think? We're happy to provide the post-start as a PR.

@cf-gitbot
Copy link
Member

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

@mkuratczyk
Copy link
Contributor

Hi Matthias,

Sorry to hear you had a cluster downtime. Your idea sounds great improvement indeed. Would you mind submitting the PR? We'll review it next week.

Thank you,

@MatthiasWinzeler
Copy link
Author

@mkuratczyk thanks for getting back to me! I created the PR in #89

@gerhard
Copy link
Contributor

gerhard commented Jul 11, 2018

The RabbitMQ app might not be able to start for genuine reasons, the most common one being a unavailable peers to complete Mnesia sync which is a boot step prior to starting rabbitmq app. In this common scenario, BOSH should continue with the next instance and not stop. A failing post-start script would prevent BOSH from continuing with the next instance in the group.

If there was a genuine issue for the RabbitMQ app crashing and not starting, then this should bring the entire Erlang VM down, meaning a changing PID that Monit and ultimately BOSH would notice and correctly mark the canary node as failing. A crashing RabbitMQ app would not bring the Erlang VM down until v3.6.13, I'm wondering if you were experiencing this known & already fixed issue: rabbitmq/rabbitmq-common#237

To guard against RabbitMQ app crashing after Mnesia sync-ing completes, it would make sense to run a node & cluster check in a post-deploy, which we already do. There's also rabbitmqctl await_cluster_formation 3 --timeout 120 available since RabbitMQ v3.7.6 which might be of interest here.

cc @nodo @albertoleal @MarcialRosales @michaelklishin

@MatthiasWinzeler
Copy link
Author

@nodo About your idea in #89 to investigate the logs:

=ERROR REPORT==== 1-Jun-2018::22:33:33 ===
** Generic server <0.6673.4975> terminating
** Last message in was {'$gen_cast',terminate}
** When Server state == {ch,running,rabbit_framing_amqp_0_9_1,1,
                         <0.16961.4977>,<0.16429.4977>,<0.16961.4977>,
                         <<"100.106.163.21:56114 -> 100.106.195.20:5672">>,
                         {lstate,<0.17527.4977>,false},
                         none,2,
                         {[{1,none,{<7325.9869.1>,1548}}],[]},
                         {user,<<"ueWpLOMKblniObwR">>,
                          [management,policymaker],
                          [{rabbit_auth_backend_internal,none}]},
                         <<"96621d9c-4ec4-4540-82c2-3ca8920b6409">>,<<>>,
                         {dict,1,16,16,8,80,48,
                          {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                          {{[],[],[],
                            [[<7325.9869.1>|
                              {resource,
                               <<"96621d9c-4ec4-4540-82c2-3ca8920b6409">>,
                               queue,<<"<REDACTED-QUEUE-NAME>">>}]],
                            [],[],[],[],[],[],[],[],[],[],[],[]}}},
                         {state,
                          {dict,1,16,16,8,80,48,
                           {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                           {{[],[],[],
                             [[<7325.9869.1>|#Ref<0.1.3680763905.11413>]],
                             [],[],[],[],[],[],[],[],[],[],[],[]}}},
                          erlang},
                         {dict,0,16,16,8,80,48,
                          {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                          {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},
                         {dict,0,16,16,8,80,48,
                          {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                          {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},
                         {set,1,16,16,8,80,48,
                          {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                          {{[],[],[],
                            [<7325.9869.1>],
                            [],[],[],[],[],[],[],[],[],[],[],[]}}},
                         <0.15941.4977>,
                         {state,fine,20000,undefined},
                         false,1,
                         {{0,nil},{0,nil}},
                         [],
                         {{0,nil},{0,nil}},
                         [{<<"consumer_cancel_notify">>,bool,true},
                          {<<"publisher_confirms">>,bool,true},
                          {<<"basic.nack">>,bool,true},
                          {<<"authentication_failure_close">>,bool,true},
                          {<<"connection.blocked">>,bool,true},
                          {<<"exchange_exchange_bindings">>,bool,true}],
                         none,0,none,flow,[]}
** Reason for termination ==
** {badarg,
       [{ets,insert,
            [channel_metrics,
             {<0.6673.4975>,
              [{pid,<0.6673.4975>},
               {transactional,false},
               {confirm,false},
               {consumer_count,0},
               {messages_unacknowledged,1},
               {messages_unconfirmed,0},
               {messages_uncommitted,0},
               {acks_uncommitted,0},
               {prefetch_count,0},
               {global_prefetch_count,0},
               {state,running},
               {garbage_collection,
                   [{max_heap_size,0},
                    {min_bin_vheap_size,46422},
                    {min_heap_size,233},
                    {fullsweep_after,65535},
                    {minor_gcs,10}]}]}],
            []},
        {rabbit_core_metrics,channel_stats,2,
            [{file,"src/rabbit_core_metrics.erl"},{line,144}]},
        {rabbit_channel,emit_stats,2,
            [{file,"src/rabbit_channel.erl"},{line,2023}]},
        {rabbit_event,if_enabled,3,[{file,"src/rabbit_event.erl"},{line,138}]},
        {rabbit_channel,terminate,2,
            [{file,"src/rabbit_channel.erl"},{line,634}]},
        {gen_server2,terminate,3,[{file,"src/gen_server2.erl"},{line,1147}]},
        {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}
** In 'terminate' callback with reason ==
** normal

This occurs multiple times for different queues.

I'm afraid I'm not allowed to share the whole logs (since it might contain customer information).
Does this suffice to make a judgement, for example looks like a known bug?
Otherwise, I try to see whether we can work with Pivotal support again (where we can safely pass logs).

post deploy logs:

node1:

Running node checks at Fri Jun  1 22:39:03 UTC 2018 from post-deploy...
Node checks running from post-deploy passed
Running cluster checks at Fri Jun  1 22:39:03 UTC 2018 from post-deploy...
Timeout: 70.0 seconds
Checking health of node rabbit@05938b8fd4a523bd529c136ca66c20c2
Error: {killed,{gen_server2,call,[<10723.16611.1>,{info,[pid]},infinity]}}
RabbitMQ application is not running

node2:

Running node checks at Fri Jun  1 22:39:02 UTC 2018 from post-deploy...
Node checks running from post-deploy passed
Running cluster checks at Fri Jun  1 22:39:03 UTC 2018 from post-deploy...
Timeout: 70.0 seconds
Checking health of node rabbit@f98e1c2c6e9f9a6f7617830ceb0b7e04
Error: operation node_health_check on node rabbit@f98e1c2c6e9f9a6f7617830ceb0b7e04 timed out
RabbitMQ application is not running

node3:

Running node checks at Fri Jun  1 22:39:02 UTC 2018 from post-deploy...
Node checks running from post-deploy passed
Running cluster checks at Fri Jun  1 22:39:03 UTC 2018 from post-deploy...
Timeout: 70.0 seconds
Checking health of node rabbit@07632a494f7a8a8fb6b7f6af08f72f2c
Error: operation node_health_check on node rabbit@07632a494f7a8a8fb6b7f6af08f72f2c timed out
RabbitMQ application is not running

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants