Check node or cluster health in post-start script #87

MatthiasWinzeler · 2018-06-13T14:38:16Z

Hi all

While changing configuration of our rabbitmq deployment lately, we ran into the following issue:

Canary was updated. Canary came up (process was running), but Rabbitmq could not start properly or crashed. This rendered the node useless - however, since the process was running, monit reported 'running' and moved on
The second node was updated as well, also coming up unhealthy, but also with a running process.
The third node had the same issue.

Result: The whole cluster was down. The post-deploy script then failed on the nodes, so we noticed something was wrong.

If we would add a post-start script, this scenario could not happen.
Since post-start are run on each VM before BOSH marks it as healthy, individual node failures would be caught before moving on with the next instance.

UAA already does that in a similar fashion: https://github.com/cloudfoundry/uaa-release/blob/develop/jobs/uaa/templates/bin/post-start

What do you think? We're happy to provide the post-start as a PR.

The text was updated successfully, but these errors were encountered:

cf-gitbot · 2018-06-13T14:38:17Z

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

mkuratczyk · 2018-06-15T15:08:21Z

Hi Matthias,

Sorry to hear you had a cluster downtime. Your idea sounds great improvement indeed. Would you mind submitting the PR? We'll review it next week.

Thank you,

MatthiasWinzeler · 2018-06-15T16:20:39Z

@mkuratczyk thanks for getting back to me! I created the PR in #89

gerhard · 2018-07-11T13:02:52Z

The RabbitMQ app might not be able to start for genuine reasons, the most common one being a unavailable peers to complete Mnesia sync which is a boot step prior to starting rabbitmq app. In this common scenario, BOSH should continue with the next instance and not stop. A failing post-start script would prevent BOSH from continuing with the next instance in the group.

If there was a genuine issue for the RabbitMQ app crashing and not starting, then this should bring the entire Erlang VM down, meaning a changing PID that Monit and ultimately BOSH would notice and correctly mark the canary node as failing. A crashing RabbitMQ app would not bring the Erlang VM down until v3.6.13, I'm wondering if you were experiencing this known & already fixed issue: rabbitmq/rabbitmq-common#237

To guard against RabbitMQ app crashing after Mnesia sync-ing completes, it would make sense to run a node & cluster check in a post-deploy, which we already do. There's also rabbitmqctl await_cluster_formation 3 --timeout 120 available since RabbitMQ v3.7.6 which might be of interest here.

cc @nodo @albertoleal @MarcialRosales @michaelklishin

MatthiasWinzeler · 2018-07-13T07:58:46Z

@nodo About your idea in #89 to investigate the logs:

Node 1 looks fine as far as I can tell (see log output in Add a post-start script to check node and cluster health #89 (comment))
Node 3 looks fine as Node 1
Node 2 outputs the following error messages in its rabbitmq@*.log:

=ERROR REPORT==== 1-Jun-2018::22:33:33 ===
** Generic server <0.6673.4975> terminating
** Last message in was {'$gen_cast',terminate}
** When Server state == {ch,running,rabbit_framing_amqp_0_9_1,1,
                         <0.16961.4977>,<0.16429.4977>,<0.16961.4977>,
                         <<"100.106.163.21:56114 -> 100.106.195.20:5672">>,
                         {lstate,<0.17527.4977>,false},
                         none,2,
                         {[{1,none,{<7325.9869.1>,1548}}],[]},
                         {user,<<"ueWpLOMKblniObwR">>,
                          [management,policymaker],
                          [{rabbit_auth_backend_internal,none}]},
                         <<"96621d9c-4ec4-4540-82c2-3ca8920b6409">>,<<>>,
                         {dict,1,16,16,8,80,48,
                          {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                          {{[],[],[],
                            [[<7325.9869.1>|
                              {resource,
                               <<"96621d9c-4ec4-4540-82c2-3ca8920b6409">>,
                               queue,<<"<REDACTED-QUEUE-NAME>">>}]],
                            [],[],[],[],[],[],[],[],[],[],[],[]}}},
                         {state,
                          {dict,1,16,16,8,80,48,
                           {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                           {{[],[],[],
                             [[<7325.9869.1>|#Ref<0.1.3680763905.11413>]],
                             [],[],[],[],[],[],[],[],[],[],[],[]}}},
                          erlang},
                         {dict,0,16,16,8,80,48,
                          {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                          {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},
                         {dict,0,16,16,8,80,48,
                          {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                          {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},
                         {set,1,16,16,8,80,48,
                          {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                          {{[],[],[],
                            [<7325.9869.1>],
                            [],[],[],[],[],[],[],[],[],[],[],[]}}},
                         <0.15941.4977>,
                         {state,fine,20000,undefined},
                         false,1,
                         {{0,nil},{0,nil}},
                         [],
                         {{0,nil},{0,nil}},
                         [{<<"consumer_cancel_notify">>,bool,true},
                          {<<"publisher_confirms">>,bool,true},
                          {<<"basic.nack">>,bool,true},
                          {<<"authentication_failure_close">>,bool,true},
                          {<<"connection.blocked">>,bool,true},
                          {<<"exchange_exchange_bindings">>,bool,true}],
                         none,0,none,flow,[]}
** Reason for termination ==
** {badarg,
       [{ets,insert,
            [channel_metrics,
             {<0.6673.4975>,
              [{pid,<0.6673.4975>},
               {transactional,false},
               {confirm,false},
               {consumer_count,0},
               {messages_unacknowledged,1},
               {messages_unconfirmed,0},
               {messages_uncommitted,0},
               {acks_uncommitted,0},
               {prefetch_count,0},
               {global_prefetch_count,0},
               {state,running},
               {garbage_collection,
                   [{max_heap_size,0},
                    {min_bin_vheap_size,46422},
                    {min_heap_size,233},
                    {fullsweep_after,65535},
                    {minor_gcs,10}]}]}],
            []},
        {rabbit_core_metrics,channel_stats,2,
            [{file,"src/rabbit_core_metrics.erl"},{line,144}]},
        {rabbit_channel,emit_stats,2,
            [{file,"src/rabbit_channel.erl"},{line,2023}]},
        {rabbit_event,if_enabled,3,[{file,"src/rabbit_event.erl"},{line,138}]},
        {rabbit_channel,terminate,2,
            [{file,"src/rabbit_channel.erl"},{line,634}]},
        {gen_server2,terminate,3,[{file,"src/gen_server2.erl"},{line,1147}]},
        {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}
** In 'terminate' callback with reason ==
** normal

This occurs multiple times for different queues.

I'm afraid I'm not allowed to share the whole logs (since it might contain customer information).
Does this suffice to make a judgement, for example looks like a known bug?
Otherwise, I try to see whether we can work with Pivotal support again (where we can safely pass logs).

post deploy logs:

node1:

Running node checks at Fri Jun  1 22:39:03 UTC 2018 from post-deploy...
Node checks running from post-deploy passed
Running cluster checks at Fri Jun  1 22:39:03 UTC 2018 from post-deploy...
Timeout: 70.0 seconds
Checking health of node rabbit@05938b8fd4a523bd529c136ca66c20c2
Error: {killed,{gen_server2,call,[<10723.16611.1>,{info,[pid]},infinity]}}
RabbitMQ application is not running

node2:

Running node checks at Fri Jun  1 22:39:02 UTC 2018 from post-deploy...
Node checks running from post-deploy passed
Running cluster checks at Fri Jun  1 22:39:03 UTC 2018 from post-deploy...
Timeout: 70.0 seconds
Checking health of node rabbit@f98e1c2c6e9f9a6f7617830ceb0b7e04
Error: operation node_health_check on node rabbit@f98e1c2c6e9f9a6f7617830ceb0b7e04 timed out
RabbitMQ application is not running

node3:

Running node checks at Fri Jun  1 22:39:02 UTC 2018 from post-deploy...
Node checks running from post-deploy passed
Running cluster checks at Fri Jun  1 22:39:03 UTC 2018 from post-deploy...
Timeout: 70.0 seconds
Checking health of node rabbit@07632a494f7a8a8fb6b7f6af08f72f2c
Error: operation node_health_check on node rabbit@07632a494f7a8a8fb6b7f6af08f72f2c timed out
RabbitMQ application is not running

cf-gitbot added the unscheduled label Jun 13, 2018

MatthiasWinzeler added a commit to MatthiasWinzeler/cf-rabbitmq-release that referenced this issue Jun 15, 2018

does a node and cluster check on post-start (see pivotal-cf#87)

bca8b5f

MatthiasWinzeler mentioned this issue Jun 15, 2018

Add a post-start script to check node and cluster health #89

Closed

cf-gitbot added scheduled and removed unscheduled labels Jul 12, 2018

cf-gitbot added unscheduled and removed scheduled labels Jul 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check node or cluster health in post-start script #87

Check node or cluster health in post-start script #87

MatthiasWinzeler commented Jun 13, 2018

cf-gitbot commented Jun 13, 2018

mkuratczyk commented Jun 15, 2018

MatthiasWinzeler commented Jun 15, 2018

gerhard commented Jul 11, 2018 •

edited

MatthiasWinzeler commented Jul 13, 2018

Check node or cluster health in post-start script #87

Check node or cluster health in post-start script #87

Comments

MatthiasWinzeler commented Jun 13, 2018

cf-gitbot commented Jun 13, 2018

mkuratczyk commented Jun 15, 2018

MatthiasWinzeler commented Jun 15, 2018

gerhard commented Jul 11, 2018 • edited

MatthiasWinzeler commented Jul 13, 2018

gerhard commented Jul 11, 2018 •

edited