Make cluster meet reliable under link failures #461

srgsanky · 2024-05-08T02:46:26Z

When there is a link failure while an ongoing MEET request is sent the sending node stops sending anymore MEET and starts sending PINGs. Since every node responds to PINGs from unknown nodes with a PONG, the receiving node never adds the sending node. But the sending node adds the receiving node when it sees a PONG. This can lead to asymmetry in cluster membership. This changes makes the sender keep sending MEET until it sees a PONG, avoiding the asymmetry.

srgsanky · 2024-05-08T02:48:05Z

Posting this for initial comments. I can migrate the test based on the new framework once #442 is merged.

codecov · 2024-05-08T03:13:01Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.72%. Comparing base (d52c8f3) to head (0cf724a).
Report is 3 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #461      +/-   ##
============================================
- Coverage     69.83%   69.72%   -0.12%     
============================================
  Files           109      109              
  Lines         61801    61810       +9     
============================================
- Hits          43158    43094      -64     
- Misses        18643    18716      +73

Files	Coverage Δ
src/cluster_legacy.c	`86.46% <100.00%> (+0.23%)`	⬆️
src/debug.c	`53.60% <100.00%> (+0.13%)`	⬆️

... and 11 files with indirect coverage changes

srgsanky · 2024-05-08T17:14:34Z

@madolson @hpatro @PingXie can one of you help review this change?

tests/cluster/tests/30-reliable-meet.tcl

src/cluster_legacy.c

madolson · 2024-05-13T01:23:16Z

src/cluster_legacy.c

-     * normal PING packets. */
-    node->flags &= ~CLUSTER_NODE_MEET;
-
+    /* NOTE: We cannot clear the MEET flag from the node until we get a response


This makes sense to me, but I thought we briefly discussed just only sending the meet once, and on reconnect just not sending another meet. The previously logic was "We'll only send a single meet", I'm wondering if there was any logic somewhere that relied on that behavior.

Are we getting into the territory that CLUSTER MEET should only be sent when it was generated by the user/admin?

The initial MEET is still triggered by an admin/user command.

/* We can clear the flag after the first packet is sent. * If we'll never receive a PONG, we'll never send new packets * to this node. Instead after the PONG is received and we * are no longer in meet/handshake status, we want to send * normal PING packets. */

The previous code comment above doesn't restrict the MEET to be sent exactly once. I can't think of any scenario where sending multiple MEETs will break - this is similar to when an admin sends MEET multiple times. We send MEET only when the connection is established in clusterLinkConnectHandler, so we still limit sending MEET when connection is newly setup and not in every cluster cron run.

this is similar to when an admin sends MEET multiple times

I believe the cluster deduplicate multiple identical requests to the same IP/port.

In the test case I am using multiple MEETs to make sure we don't stop dropping the link early.

wait_for_condition 1000 50 { [CI $b cluster_stats_messages_meet_received] >= 3 } else { fail "Cluster node $a never sent multiple MEETs to $b" }

B isn't processing these right? It's just immediately dropping the first three and not processing them, it only ever processes the 4th one once correct?

I am not concerned with sending multiple "MEET". It should be idempotent but agreed that we should add a test to validate it (not using the debug hook to fail the previous MEETs)

B isn't processing these right? It's just immediately dropping the first three and not processing them, it only ever processes the 4th one once correct?

Correct. It starts processing when we drop the filter - which can be 4th or later.

agreed that we should add a test to validate it (not using the debug hook to fail the previous MEETs)

To test this scenario, the test has to send packets to the target node pretending to be a Valkey node. This way we can send multiple MEETs. Can you point me to any test that directly sends cluster messages to nodes, so I can write one with multiple MEETs?

I think the test could look like:

Node A is told to meet Node B. We drop all incoming packets on A, so although B get's the response it is never acknowledged.

We force B to drop it's connection with A. We support DEBUG CLUSTERLINK KILL <node id> all. (I forgot about this till just now).

A will reconnect, and send another meet.

hpatro · 2024-05-13T14:23:40Z

src/cluster_legacy.c

-     * normal PING packets. */
-    node->flags &= ~CLUSTER_NODE_MEET;
-
+    /* NOTE: We cannot clear the MEET flag from the node until we get a response


Are we getting into the territory that CLUSTER MEET should only be sent when it was generated by the user/admin?

src/server.h

hpatro · 2024-05-13T14:32:01Z

I think it's worth investing on this redis/redis#11095 to avoid this issue altogether.

srgsanky · 2024-05-13T19:36:02Z

I think it's worth investing on this redis/redis#11095 to avoid this issue altogether.

Thanks, I wasn't aware of this linked issue. IMO these two issues can be solved independently. The linked issue tries to make the admin experience better for MEET command where as this PR tries to address a specific gap in MEET implementation.

With SYNC MEET, we will have to make changes to admin client timeout. This timeout can possible trickle up the stack in a control plane implementation.
If we choose to attempt handshake for a longer period of time, we either have to filter out nodes in handshake in cluster nodes output for a non-admin client or make the clients filter out the nodes with this new flag. This can require a client side change to avoid connecting to a node in handshake, experiencing availability issues.

The problem addressed in this PR (asymmetric cluster membership) can happen with SYNC MEET as well due to link failures. So, it is worth solving it. The handshake nodes will still be removed after the handshake timeout (same as node_timeout of 15s). Wdyt?

madolson · 2024-05-13T20:02:34Z

The problem addressed in this PR (asymmetric cluster membership) can happen with SYNC MEET as well due to link failures. So, it is worth solving it. The handshake nodes will still be removed after the handshake timeout (same as node_timeout of 15s). Wdyt?

Yeah, I still believe this a problem even with the #11095.

zuiderkwast · 2024-05-13T20:31:18Z

Awesome material for our next release which will be full of cluster improvements. Is it worth mentioning in release notes?

Btw @srgsanky you need to commit with -s. See the instructions on the DCO CI job's details page.

madolson · 2024-05-13T20:32:35Z

Awesome material for our next release which will be full of cluster improvements. Is it worth mentioning in release notes?

I would also be inclined to backport it.

tests/cluster/tests/30-reliable-meet.tcl

srgsanky · 2024-05-19T21:12:21Z

Awesome material for our next release which will be full of cluster improvements. Is it worth mentioning in release notes?

Btw @srgsanky you need to commit with -s. See the instructions on the DCO CI job's details page.

When I tried to merge the new changes into my fork, I ended up with a merge commit

* 2ff9879fa (HEAD -> unstable, origin/unstable, origin/HEAD) Moved test under unit and addressed other comments
*   b826ef77a Merge branch 'valkey-io:unstable' into unstable
|\
| * d52c8f30e Include stddef in zmalloc.h (#516)
| * dcc9fd4fe Resolve numtests counter error (#514)
...
| * 315b7573c Update server function's name to valkey (#456)
* | 49a884c06 Make cluster meet reliable under link failures
|/
* 4e944cede Migrate kvstore.c unit tests to new test framework. (#446)

I want to signoff just 49a884c, but the rebase is adding a signoff to all commits 315b757..d52c8f3 which are not made by me.

Do you have any recommendation to fix this?

As an alternate option, I can start fresh and add a new commit from the tip of unstable. I am not sure if I will be able to reuse this PR.

zuiderkwast · 2024-05-19T22:15:50Z

I believe it's possible to undo a merge by git reset --hard 49a884c06 (the commit before the merge commit), then rebase to add the --signoff, then do git merge unstable again. The commit you added after merge commit can be cherry-picked after all this. Just remember the commit id.

If nothing works, then it's always possible to start from scratch with a new branch and cherry-pick all your commits into it. Then you can rename the branches and force-push to this PR's branch.

madolson · 2024-05-20T02:20:49Z

@srgsanky ~~The commit missing the DCO is just the top one. You should just be able to do git commit -s --amend with a no-op and force push over what you have.~~ It's the base commit, nevermind. I believe git rebase -i HEAD~3 should allow you to manually add the signature, maybe there is a better way to do it.

madolson

Just some minor nitpicks around the tests, it overall LGTM.

tests/unit/cluster/cluster-reliable-meet.tcl

tests/cluster/tests/includes/init-tests.tcl

tests/unit/cluster/cluster-reliable-meet.tcl

madolson · 2024-05-20T02:33:08Z

src/cluster_legacy.c

-     * normal PING packets. */
-    node->flags &= ~CLUSTER_NODE_MEET;
-
+    /* NOTE: We cannot clear the MEET flag from the node until we get a response


B isn't processing these right? It's just immediately dropping the first three and not processing them, it only ever processes the 4th one once correct?

When there is a link failure while an ongoing MEET request is sent the sending node stops sending anymore MEET and starts sending PINGs. Since every node responds to PINGs from unknown nodes with a PONG, the receiving node never adds the sending node. But the sending node adds the receiving node when it sees a PONG. This can lead to asymmetry in cluster membership. This changes makes the sender keep sending MEET until it sees a PONG, avoiding the asymmetry. Signed-off-by: Sankar <1890648+srgsanky@users.noreply.github.com>

Signed-off-by: Sankar <1890648+srgsanky@users.noreply.github.com>

srgsanky · 2024-05-20T04:17:20Z

B isn't processing these right? It's just immediately dropping the first three and not processing them, it only ever processes the 4th one once correct?

Correct. It starts processing when we drop the filter - which can be 4th or later.

srgsanky · 2024-05-20T04:19:37Z

I believe it's possible to undo a merge by git reset --hard 49a884c (the commit before the merge commit), then rebase to add the --signoff, then do git merge unstable again. The commit you added after merge commit can be cherry-picked after all this. Just remember the commit id.

This worked. Thanks!

It's the base commit, nevermind. I believe git rebase -i HEAD~3 should allow you to manually add the signature, maybe there is a better way to do it.

I tried this and all the commits in the other branch of the merge was also annotated with my signoff. So, I decided to ask you folks for the best approach.

btw is there any reasoning behind the requirement for the signoff?

Signed-off-by: Sankar <1890648+srgsanky@users.noreply.github.com>

madolson · 2024-05-20T04:56:13Z

btw is there any reasoning behind the requirement for the signoff?

Technically we adopted it because it's an LF requirement, but it's also a good practice to force a trail of who committed what.

PingXie

this change LGTM overall.

PingXie · 2024-05-20T05:01:34Z

src/server.h

@@ -2027,6 +2027,8 @@ struct valkeyServer {
    unsigned long long cluster_link_msg_queue_limit_bytes;  /* Memory usage limit on individual link msg queue */
    int cluster_drop_packet_filter; /* Debug config that allows tactically
                                   * dropping packets of a specific type */
+    int cluster_close_link_on_packet_drop; /* Debug config that goes along with cluster_drop_packet_filter.


I would really suggest we start naming debug flags in a distinct way, either prefix or suffix works for me. This helps readability. Not a blocker of this PR though since we have many existing debug flags already.

Additionally, it'd be enough to use one bit instead of 64. How about int debug_flags for all these?

I had a different opinion which is that I think these debug flags should be entirely compiled out of our production build. They can use as much memory as they want, that also makes it clearer they are debug related, since they won't be available.

I'm not sure I agree. It's a good habit to run make test before make install. For make test, you need the debug functionality.

It's a good habit to run make test before make install.

I have never done this. If it is a thing people do, I guess I retract my position.

I don't know if people do it. :)

Anyway, I think there's some value to be able to run tests (ours or others) on an unmodified build.

I have added a prefix of debug_ and using 1 bit for the new flag. The flags are currently sandwiched between two 8 byte types, so the flag will anyway get 8 bytes.

I didn't touch int cluster_drop_packet_filter as it is initialized with -1 to avoid dropping PING packets. It can probably be set to CLUSTERMSG_TYPE_COUNT, but that can change when we add a new message type requiring a code change. It cannot be one bit.

#define CLUSTERMSG_TYPE_PING 0 /* Ping */ #define CLUSTERMSG_TYPE_PONG 1 /* Pong (reply to Ping) */ #define CLUSTERMSG_TYPE_MEET 2 /* Meet "let's join" message */ #define CLUSTERMSG_TYPE_FAIL 3 /* Mark node xxx as failing */ #define CLUSTERMSG_TYPE_PUBLISH 4 /* Pub/Sub Publish propagation */ #define CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST 5 /* May I failover? */ #define CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK 6 /* Yes, you have my vote */ #define CLUSTERMSG_TYPE_UPDATE 7 /* Another node slots configuration */ #define CLUSTERMSG_TYPE_MFSTART 8 /* Pause clients for manual failover */ #define CLUSTERMSG_TYPE_MODULE 9 /* Module cluster API message. */ #define CLUSTERMSG_TYPE_PUBLISHSHARD 10 /* Pub/Sub Publish shard propagation */ #define CLUSTERMSG_TYPE_COUNT 11 /* Total number of message types. */

Before this PR

/* 5328 | 4 */ int cluster_module_flags; /* 5332 | 4 */ int cluster_allow_reads_when_down; /* 5336 | 4 */ int cluster_config_file_lock_fd; /* XXX 4-byte hole */ /* 5344 | 8 */ unsigned long long cluster_link_msg_queue_limit_bytes; /* 5352 | 4 */ int cluster_drop_packet_filter; /* XXX 4-byte hole */ /* 5360 | 8 */ mstime_t busy_reply_threshold; /* 5368 | 4 */ int pre_command_oom_state; /* 5372 | 4 */ int script_disable_deny_script;

~~When I use 1 bit for the new field, there is extra hole introduced by gcc which I am not able to explain. It looks like I need to use either uint32_t or uint16_t.~~

My bad, I read the 7-bit as 7-bytes earlier which made me assume we have to use uint32_t or uint16_t. I am sticking to uint32_t with 1 bit representation.

When using uint32_t

/* 5328 | 4 */ int cluster_module_flags; /* 5332 | 4 */ int cluster_allow_reads_when_down; /* 5336 | 4 */ int cluster_config_file_lock_fd; /* XXX 4-byte hole */ /* 5344 | 8 */ unsigned long long cluster_link_msg_queue_limit_bytes; /* 5352 | 4 */ int cluster_drop_packet_filter; /* 5356 | 4 */ uint32_t debug_cluster_close_link_on_packet_drop; /* 5360 | 8 */ mstime_t busy_reply_threshold; /* 5368 | 4 */ int pre_command_oom_state; /* 5372 | 4 */ int script_disable_deny_script;

When using 1 bit with uint32_t

/* 5328 | 4 */ int cluster_module_flags; /* 5332 | 4 */ int cluster_allow_reads_when_down; /* 5336 | 4 */ int cluster_config_file_lock_fd; /* XXX 4-byte hole */ /* 5344 | 8 */ unsigned long long cluster_link_msg_queue_limit_bytes; /* 5352 | 4 */ int cluster_drop_packet_filter; /* 5356: 0 | 4 */ uint32_t debug_cluster_close_link_on_packet_drop : 1; /* XXX 7-bit hole */ /* XXX 3-byte hole */ /* 5360 | 8 */ mstime_t busy_reply_threshold; /* 5368 | 4 */ int pre_command_oom_state; /* 5372 | 4 */ int script_disable_deny_script;

When using uint16_t

/* 5328 | 4 */ int cluster_module_flags; /* 5332 | 4 */ int cluster_allow_reads_when_down; /* 5336 | 4 */ int cluster_config_file_lock_fd; /* XXX 4-byte hole */ /* 5344 | 8 */ unsigned long long cluster_link_msg_queue_limit_bytes; /* 5352 | 4 */ int cluster_drop_packet_filter; /* 5356 | 2 */ uint16_t debug_cluster_close_link_on_packet_drop; /* XXX 2-byte hole */ /* 5360 | 8 */ mstime_t busy_reply_threshold; /* 5368 | 4 */ int pre_command_oom_state; /* 5372 | 4 */ int script_disable_deny_script;

When using 1 bit with uint16_t

/* 5328 | 4 */ int cluster_module_flags; /* 5332 | 4 */ int cluster_allow_reads_when_down; /* 5336 | 4 */ int cluster_config_file_lock_fd; /* XXX 4-byte hole */ /* 5344 | 8 */ unsigned long long cluster_link_msg_queue_limit_bytes; /* 5352 | 4 */ int cluster_drop_packet_filter; /* 5356: 0 | 2 */ uint16_t debug_cluster_close_link_on_packet_drop : 1; /* XXX 7-bit hole */ /* XXX 3-byte hole */ /* 5360 | 8 */ mstime_t busy_reply_threshold; /* 5368 | 4 */ int pre_command_oom_state; /* 5372 | 4 */ int script_disable_deny_script;

There are many ints in the server struct that are essentially booleans. I think we can allow ourselves to rearrange the fields within the same section. We could do this:

int cluster_allow_replica_migration:1; /* Automatic replica migrations to orphaned masters and from empty masters */ int cluster_require_full_coverage:1; /* If true, put the cluster down if there is at least an uncovered slot.*/ int cluster_slave_no_failover:1; /* Prevent slave from starting a failover if the master is in failure state. */ int cluster_allow_reads_when_down:1; /* Are reads allowed when the cluster is down? */

There are probably many bytes to save, but let's not dive too deep in this, since it's a singleton.

src/cluster_legacy.c

PingXie · 2024-05-20T05:43:24Z

src/cluster_legacy.c

-     * normal PING packets. */
-    node->flags &= ~CLUSTER_NODE_MEET;
-
+    /* NOTE: We cannot clear the MEET flag from the node until we get a response


I am not concerned with sending multiple "MEET". It should be idempotent but agreed that we should add a test to validate it (not using the debug hook to fail the previous MEETs)

src/cluster_legacy.c

hpatro · 2024-05-20T15:54:22Z

src/debug.c

+"    This is valid only when DROP-CLUSTER-PACKET-FILTER is set to a valid packet type."
+"    When set to 1, the cluster link is closed after dropping a packet based on the filter."


I'm not sure if we have interdependent debug flags prior to this.

The other alternative that I can think of is to introduce something like DROP-CLUSTER-PACKET-AND-CLOSE-LINK based on the type of packet.

We also need a more fine grained debug filter based on sender IP and packet type. I think we can reuse the current config for that kind of filter as well.

I think it's fine.

tests/unit/cluster/cluster-reliable-meet.tcl

1. Reworked code comment 1. Added serverLogs 1. Renamed debug variable 1. Made close link filter to be directly coupled with drop filter Signed-off-by: Sankar <1890648+srgsanky@users.noreply.github.com>

madolson reviewed May 13, 2024

View reviewed changes

tests/cluster/tests/30-reliable-meet.tcl Outdated Show resolved Hide resolved

madolson reviewed May 13, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

madolson reviewed May 13, 2024

View reviewed changes

hpatro reviewed May 13, 2024

View reviewed changes

madolson added the release-notes This issue should get a line item in the release notes label May 13, 2024

madolson reviewed May 13, 2024

View reviewed changes

tests/cluster/tests/30-reliable-meet.tcl Outdated Show resolved Hide resolved

srgsanky force-pushed the unstable branch 2 times, most recently from d8aa71c to 2ff9879 Compare May 19, 2024 21:22

madolson reviewed May 20, 2024

View reviewed changes

srgsanky added 3 commits May 19, 2024 21:07

Merge commit 'd52c8f30e' into unstable

d8e6da5

Signed-off-by: Sankar <1890648+srgsanky@users.noreply.github.com>

Moved test under unit and addressed other comments

ec68829

Signed-off-by: Sankar <1890648+srgsanky@users.noreply.github.com>

Addressed comments about the unit test

a760aa6

Signed-off-by: Sankar <1890648+srgsanky@users.noreply.github.com>

srgsanky force-pushed the unstable branch from 2ff9879 to a760aa6 Compare May 20, 2024 04:20

PingXie reviewed May 20, 2024

View reviewed changes

hpatro approved these changes May 20, 2024

View reviewed changes

Addressed comments

0cf724a

1. Reworked code comment 1. Added serverLogs 1. Renamed debug variable 1. Made close link filter to be directly coupled with drop filter Signed-off-by: Sankar <1890648+srgsanky@users.noreply.github.com>

		" This is valid only when DROP-CLUSTER-PACKET-FILTER is set to a valid packet type."
		" When set to 1, the cluster link is closed after dropping a packet based on the filter."

Make cluster meet reliable under link failures #461

Are you sure you want to change the base?

Make cluster meet reliable under link failures #461

Conversation

srgsanky commented May 8, 2024

srgsanky commented May 8, 2024

codecov bot commented May 8, 2024 • edited

Codecov Report

srgsanky commented May 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hpatro commented May 13, 2024

srgsanky commented May 13, 2024

madolson commented May 13, 2024

zuiderkwast commented May 13, 2024

madolson commented May 13, 2024

srgsanky commented May 19, 2024

zuiderkwast commented May 19, 2024

madolson commented May 20, 2024 • edited

madolson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srgsanky commented May 20, 2024

srgsanky commented May 20, 2024

madolson commented May 20, 2024

PingXie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srgsanky May 21, 2024 • edited

Choose a reason for hiding this comment

When using uint32_t

When using 1 bit with uint32_t

When using uint16_t

When using 1 bit with uint16_t

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented May 8, 2024 •

edited

madolson commented May 20, 2024 •

edited

srgsanky May 21, 2024 •

edited

When using `uint32_t`

When using 1 bit with `uint32_t`

When using `uint16_t`

When using 1 bit with `uint16_t`