Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Crashes When Distribution Key Is Too High #30975

Open
yuvikk opened this issue Apr 19, 2024 · 2 comments
Open

Cluster Crashes When Distribution Key Is Too High #30975

yuvikk opened this issue Apr 19, 2024 · 2 comments
Assignees
Milestone

Comments

@yuvikk
Copy link

yuvikk commented Apr 19, 2024

Describe the bug
The following change deployed successfully but crashed the entire Vespa cluster:
From:

<group distribution-key="1" name="group1">
  <node distribution-key="124" hostalias="vespa10124"/>
</group>

To:

<group distribution-key="1" name="group1">
  <node distribution-key="124001" hostalias="vespa10124"/>
  <node distribution-key="124002" hostalias="vespa10124"/>
</group>

To Reproduce
Steps to reproduce the behavior:
Deploy a high distribution key such as 124001 and 124002.

Logs

[2024-04-18 17:13:42.140] INFO    container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.MasterElectionHandler	Cluster 'vespa1': 0 is new master candidate, but needs to wait before it can take over
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	java.lang.NullPointerException: Cannot invoke "com.yahoo.vespa.clustercontroller.core.NodeInfo.getRpcAddress()" because "node" is null
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.StateChangeHandler.handleNewRpcAddress(StateChangeHandler.java:222)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.FleetController.handleNewRpcAddress(FleetController.java:337)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.rpc.SlobrokClient.updateCluster(SlobrokClient.java:144)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.FleetController.lambda$resyncLocallyCachedState$15(FleetController.java:803)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.MetricUpdater.forWork(MetricUpdater.java:115)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.FleetController.resyncLocallyCachedState(FleetController.java:803)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.FleetController.tick(FleetController.java:521)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat com.yahoo.vespa.clustercontroller.core.FleetController.run(FleetController.java:1031)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr	\tat java.base/java.lang.Thread.run(Thread.java:840)
[2024-04-18 17:13:42.172] ERROR   container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.FleetController	
	Cluster 'vespa1': Fatal error killed fleet controller
	exception=
	java.lang.NullPointerException: Cannot invoke "com.yahoo.vespa.clustercontroller.core.NodeInfo.getRpcAddress()" because "node" is null
	at com.yahoo.vespa.clustercontroller.core.StateChangeHandler.handleNewRpcAddress(StateChangeHandler.java:222)
	at com.yahoo.vespa.clustercontroller.core.FleetController.handleNewRpcAddress(FleetController.java:337)
	at com.yahoo.vespa.clustercontroller.core.rpc.SlobrokClient.updateCluster(SlobrokClient.java:144)
	at com.yahoo.vespa.clustercontroller.core.FleetController.lambda$resyncLocallyCachedState$15(FleetController.java:803)
	at com.yahoo.vespa.clustercontroller.core.MetricUpdater.forWork(MetricUpdater.java:115)
	at com.yahoo.vespa.clustercontroller.core.FleetController.resyncLocallyCachedState(FleetController.java:803)
	at com.yahoo.vespa.clustercontroller.core.FleetController.tick(FleetController.java:521)
	at com.yahoo.vespa.clustercontroller.core.FleetController.run(FleetController.java:1031)
	at java.base/java.lang.Thread.run(Thread.java:840)

Environment (please complete the following information):

  • OS: Linux version 4.18.0-372.9.1.el8.x86_64 (mockbuild@dal1-prod-builder001.bld.equ.rockylinux.org) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-10) (GCC)) #1 SMP Tue May 10 14:48:47 UTC 2022
  • Infrastructure: self-hosted

Vespa version
8.320.68

@kkraune
Copy link
Member

kkraune commented Apr 19, 2024

@vekterli maybe we should consider better guiding on consequences of high distr key (ref the slack conversation) - the distr algo will slow down a lot with this, so maybe having a configurable upper bound is better, with a good error message at deploy

@vekterli
Copy link
Member

Yes, this particular use case (encoding host name patterns in the distribution keys) has been a recurring theme throughout the years. Doing so makes sense from an application modelling perspective, but makes the distribution algorithm give off blue smoke from burning CPU on generating pseudo-random numbers, and should therefore be discouraged.

As a start, we should certainly never allow deployments to pass validation when specifying distribution keys that exceed the internal type limits. Distribution keys are 16-bit integers internally, with UINT16_MAX treated as a special sentinel. So the valid distribution key range is never outside [0, UINT16_MAX - 1].

It would be fairly trivial to create a new version of the distribution algorithm that is O(|configured nodes|) rather than O(highest configured distribution key), but doing so in a backwards compatible manner is Complicated™️ at the best of times, which is the reason why it hasn't been done yet...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants