You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To Reproduce
Steps to reproduce the behavior:
Deploy a high distribution key such as 124001 and 124002.
Logs
[2024-04-18 17:13:42.140] INFO container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.MasterElectionHandler Cluster 'vespa1': 0 is new master candidate, but needs to wait before it can take over
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr java.lang.NullPointerException: Cannot invoke "com.yahoo.vespa.clustercontroller.core.NodeInfo.getRpcAddress()" because "node" is null
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr \tat com.yahoo.vespa.clustercontroller.core.StateChangeHandler.handleNewRpcAddress(StateChangeHandler.java:222)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr \tat com.yahoo.vespa.clustercontroller.core.FleetController.handleNewRpcAddress(FleetController.java:337)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr \tat com.yahoo.vespa.clustercontroller.core.rpc.SlobrokClient.updateCluster(SlobrokClient.java:144)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr \tat com.yahoo.vespa.clustercontroller.core.FleetController.lambda$resyncLocallyCachedState$15(FleetController.java:803)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr \tat com.yahoo.vespa.clustercontroller.core.MetricUpdater.forWork(MetricUpdater.java:115)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr \tat com.yahoo.vespa.clustercontroller.core.FleetController.resyncLocallyCachedState(FleetController.java:803)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr \tat com.yahoo.vespa.clustercontroller.core.FleetController.tick(FleetController.java:521)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr \tat com.yahoo.vespa.clustercontroller.core.FleetController.run(FleetController.java:1031)
[2024-04-18 17:13:42.172] WARNING container-clustercontroller stderr \tat java.base/java.lang.Thread.run(Thread.java:840)
[2024-04-18 17:13:42.172] ERROR container-clustercontroller Container.com.yahoo.vespa.clustercontroller.core.FleetController
Cluster 'vespa1': Fatal error killed fleet controller
exception=
java.lang.NullPointerException: Cannot invoke "com.yahoo.vespa.clustercontroller.core.NodeInfo.getRpcAddress()" because "node" is null
at com.yahoo.vespa.clustercontroller.core.StateChangeHandler.handleNewRpcAddress(StateChangeHandler.java:222)
at com.yahoo.vespa.clustercontroller.core.FleetController.handleNewRpcAddress(FleetController.java:337)
at com.yahoo.vespa.clustercontroller.core.rpc.SlobrokClient.updateCluster(SlobrokClient.java:144)
at com.yahoo.vespa.clustercontroller.core.FleetController.lambda$resyncLocallyCachedState$15(FleetController.java:803)
at com.yahoo.vespa.clustercontroller.core.MetricUpdater.forWork(MetricUpdater.java:115)
at com.yahoo.vespa.clustercontroller.core.FleetController.resyncLocallyCachedState(FleetController.java:803)
at com.yahoo.vespa.clustercontroller.core.FleetController.tick(FleetController.java:521)
at com.yahoo.vespa.clustercontroller.core.FleetController.run(FleetController.java:1031)
at java.base/java.lang.Thread.run(Thread.java:840)
Environment (please complete the following information):
OS: Linux version 4.18.0-372.9.1.el8.x86_64 (mockbuild@dal1-prod-builder001.bld.equ.rockylinux.org) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-10) (GCC)) #1 SMP Tue May 10 14:48:47 UTC 2022
Infrastructure: self-hosted
Vespa version
8.320.68
The text was updated successfully, but these errors were encountered:
@vekterli maybe we should consider better guiding on consequences of high distr key (ref the slack conversation) - the distr algo will slow down a lot with this, so maybe having a configurable upper bound is better, with a good error message at deploy
Yes, this particular use case (encoding host name patterns in the distribution keys) has been a recurring theme throughout the years. Doing so makes sense from an application modelling perspective, but makes the distribution algorithm give off blue smoke from burning CPU on generating pseudo-random numbers, and should therefore be discouraged.
As a start, we should certainly never allow deployments to pass validation when specifying distribution keys that exceed the internal type limits. Distribution keys are 16-bit integers internally, with UINT16_MAX treated as a special sentinel. So the valid distribution key range is never outside [0, UINT16_MAX - 1].
It would be fairly trivial to create a new version of the distribution algorithm that is O(|configured nodes|) rather than O(highest configured distribution key), but doing so in a backwards compatible manner is Complicated™️ at the best of times, which is the reason why it hasn't been done yet...
Describe the bug
The following change deployed successfully but crashed the entire Vespa cluster:
From:
To:
To Reproduce
Steps to reproduce the behavior:
Deploy a high distribution key such as 124001 and 124002.
Logs
Environment (please complete the following information):
Linux version 4.18.0-372.9.1.el8.x86_64 (mockbuild@dal1-prod-builder001.bld.equ.rockylinux.org) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-10) (GCC)) #1 SMP Tue May 10 14:48:47 UTC 2022
Vespa version
8.320.68
The text was updated successfully, but these errors were encountered: