Cluster Crashes When Distribution Key Is Too High #30975

yuvikk · 2024-04-19T13:35:29Z

Describe the bug
The following change deployed successfully but crashed the entire Vespa cluster:
From:

<group distribution-key="1" name="group1">
  <node distribution-key="124" hostalias="vespa10124"/>
</group>

To:

<group distribution-key="1" name="group1">
  <node distribution-key="124001" hostalias="vespa10124"/>
  <node distribution-key="124002" hostalias="vespa10124"/>
</group>

To Reproduce
Steps to reproduce the behavior:
Deploy a high distribution key such as 124001 and 124002.

Logs

[2024-04-18 17:13:42.140] INFO    container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.MasterElectionHandler	Cluster 'vespa1': 0 is new master candidate, but needs to wait before it can take over
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	java.lang.NullPointerException: Cannot invoke "com.yahoo.vespa.clustercontroller.core.NodeInfo.getRpcAddress()" because "node" is null
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.StateChangeHandler.handleNewRpcAddress(StateChangeHandler.java:222)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.FleetController.handleNewRpcAddress(FleetController.java:337)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.rpc.SlobrokClient.updateCluster(SlobrokClient.java:144)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.FleetController.lambda$resyncLocallyCachedState$15(FleetController.java:803)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.MetricUpdater.forWork(MetricUpdater.java:115)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.FleetController.resyncLocallyCachedState(FleetController.java:803)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.FleetController.tick(FleetController.java:521)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.FleetController.run(FleetController.java:1031)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat java.base/java.lang.Thread.run(Thread.java:840)
[2024-04-18 17:13:42.172] ERROR   container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.FleetController	
	Cluster 'vespa1': Fatal error killed fleet controller
	exception=
	java.lang.NullPointerException: Cannot invoke "com.yahoo.vespa.clustercontroller.core.NodeInfo.getRpcAddress()" because "node" is null
	at com.yahoo.vespa.clustercontroller.core.StateChangeHandler.handleNewRpcAddress(StateChangeHandler.java:222)
	at com.yahoo.vespa.clustercontroller.core.FleetController.handleNewRpcAddress(FleetController.java:337)
	at com.yahoo.vespa.clustercontroller.core.rpc.SlobrokClient.updateCluster(SlobrokClient.java:144)
	at com.yahoo.vespa.clustercontroller.core.FleetController.lambda$resyncLocallyCachedState$15(FleetController.java:803)
	at com.yahoo.vespa.clustercontroller.core.MetricUpdater.forWork(MetricUpdater.java:115)
	at com.yahoo.vespa.clustercontroller.core.FleetController.resyncLocallyCachedState(FleetController.java:803)
	at com.yahoo.vespa.clustercontroller.core.FleetController.tick(FleetController.java:521)
	at com.yahoo.vespa.clustercontroller.core.FleetController.run(FleetController.java:1031)
	at java.base/java.lang.Thread.run(Thread.java:840)

Environment (please complete the following information):

OS: Linux version 4.18.0-372.9.1.el8.x86_64 (mockbuild@dal1-prod-builder001.bld.equ.rockylinux.org) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-10) (GCC)) #1 SMP Tue May 10 14:48:47 UTC 2022
Infrastructure: self-hosted

Vespa version
8.320.68

The text was updated successfully, but these errors were encountered:

kkraune · 2024-04-19T14:05:53Z

@vekterli maybe we should consider better guiding on consequences of high distr key (ref the slack conversation) - the distr algo will slow down a lot with this, so maybe having a configurable upper bound is better, with a good error message at deploy

vekterli · 2024-04-19T14:37:04Z

Yes, this particular use case (encoding host name patterns in the distribution keys) has been a recurring theme throughout the years. Doing so makes sense from an application modelling perspective, but makes the distribution algorithm give off blue smoke from burning CPU on generating pseudo-random numbers, and should therefore be discouraged.

As a start, we should certainly never allow deployments to pass validation when specifying distribution keys that exceed the internal type limits. Distribution keys are 16-bit integers internally, with UINT16_MAX treated as a special sentinel. So the valid distribution key range is never outside [0, UINT16_MAX - 1].

It would be fairly trivial to create a new version of the distribution algorithm that is O(|configured nodes|) rather than O(highest configured distribution key), but doing so in a backwards compatible manner is Complicated™️ at the best of times, which is the reason why it hasn't been done yet...

kkraune assigned vekterli Apr 19, 2024

geirst added this to the soon milestone Apr 24, 2024

vekterli mentioned this issue Apr 24, 2024

Enforce that content node distribution keys are in legal range #31017

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Crashes When Distribution Key Is Too High #30975

Cluster Crashes When Distribution Key Is Too High #30975

yuvikk commented Apr 19, 2024

kkraune commented Apr 19, 2024

vekterli commented Apr 19, 2024

Cluster Crashes When Distribution Key Is Too High #30975

Cluster Crashes When Distribution Key Is Too High #30975

Comments

yuvikk commented Apr 19, 2024

kkraune commented Apr 19, 2024

vekterli commented Apr 19, 2024