Be defensive around move_plan calculations for stack components #390

Z3r0Sum · 2022-02-18T16:54:44Z

Readiness Checklist

I am running the latest version
I checked the documentation and found no answer
I checked to make sure that this issue has not already been filed
I am reporting the issue to the correct repository (for multi-repository projects)

Expected Behavior

If a move plan calculation cannot be computed, we should know that and then bubble the appropriate error up rather than assume it was calculated. An example will be outlined below.

Current Behavior

I tried updating to the latest cloud-sdk-go version and ecctl version locally and received the exact same stack trace as older versions (both are on v1.8.0), so I don't perceive this as a new problem and perhaps a backport should be considered when we fix it. Here's the crux of the problem:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1d3103d]

goroutine 90 [running]:
github.com/elastic/cloud-sdk-go/pkg/api/platformapi/allocatorapi.ComputeVacateRequest(0xc000320280, 0xc000799cd8, 0x1, 0x1, 0x2d18240, 0x0, 0x0, 0x0, 0xc0001297cb, 0xc0004fa121, ...)
	/Users/stephenschmidt/go/pkg/mod/github.com/elastic/cloud-sdk-go@v1.8.0/pkg/api/platformapi/allocatorapi/vacate.go:708 +0xc1d
github.com/elastic/cloud-sdk-go/pkg/api/platformapi/allocatorapi.newMoveClusterParams(0xc000497c70, 0xc0003d4920, 0x1, 0x1)
	/Users/stephenschmidt/go/pkg/mod/github.com/elastic/cloud-sdk-go@v1.8.0/pkg/api/platformapi/allocatorapi/vacate.go:404 +0x3a7
github.com/elastic/cloud-sdk-go/pkg/api/platformapi/allocatorapi.moveClusterByType(0xc000497c70, 0xc000497c70, 0x0)
	/Users/stephenschmidt/go/pkg/mod/github.com/elastic/cloud-sdk-go@v1.8.0/pkg/api/platformapi/allocatorapi/vacate.go:441 +0x45
github.com/elastic/cloud-sdk-go/pkg/api/platformapi/allocatorapi.VacateCluster(0xc000497c70, 0xc000488fa8, 0x135bf26)
	/Users/stephenschmidt/go/pkg/mod/github.com/elastic/cloud-sdk-go@v1.8.0/pkg/api/platformapi/allocatorapi/vacate.go:310 +0x77
github.com/elastic/cloud-sdk-go/pkg/api/platformapi/allocatorapi.VacateClusterInPool(0x22de040, 0xc000497c70, 0xc00005f501, 0x0)
	/Users/stephenschmidt/go/pkg/mod/github.com/elastic/cloud-sdk-go@v1.8.0/pkg/api/platformapi/allocatorapi/vacate.go:297 +0x53
github.com/elastic/cloud-sdk-go/pkg/sync/pool.Work.func2(0xc00010e180, 0xc00058e420, 0xc00058e600, 0xc00058e840, 0xc00058e960, 0xc00010e1e0, 0x21da570, 0x6fc23ac00, 0x22de040, 0xc000497c70, ...)
	/Users/stephenschmidt/go/pkg/mod/github.com/elastic/cloud-sdk-go@v1.8.0/pkg/sync/pool/worker.go:82 +0x44
created by github.com/elastic/cloud-sdk-go/pkg/sync/pool.Work
	/Users/stephenschmidt/go/pkg/mod/github.com/elastic/cloud-sdk-go@v1.8.0/pkg/sync/pool/worker.go:81 +0x21d

"Debugging" locally:

Computation complete payload: &{Failures:0xc0004f0080 Moves:0xc0004f0100}
Computation complete payload error: &{ApmClusters:[] AppsearchClusters:[] ElasticsearchClusters:[] EnterpriseSearchClusters:[] KibanaClusters:[]}
Computation complete payload moves: <nil>
We think this is ent search, plan config: <nil>

panic: runtime error: invalid memory address or nil pointer dereference

The problem is indeed the plan computation is returning nil and we don't bother to check for nil values here: https://github.com/elastic/cloud-sdk-go/blob/master/pkg/api/platformapi/allocatorapi/vacate.go#L708, but more specifically on the field:

https://github.com/elastic/cloud-sdk-go/blob/master/pkg/models/move_clusters_details.go#L54 -> https://github.com/elastic/cloud-sdk-go/blob/master/pkg/models/move_enterprise_search_details.go#L41

We should be checking that we have a calculated plan for everything, rather than just assuming we'll get one as a good defensive programming practice. Mym recommendation is that each stack component and cluster in its corresponding slice should have the check and bubble the error up for a a nil calculated plan here: https://github.com/elastic/cloud-sdk-go/blob/master/pkg/api/platformapi/allocatorapi/vacate.go#L633-L718.

At the end of the day it's a chicken and egg issue, so if ES is in a really bad state, something like entsearch can't be checked; thus, we can't calc a plan. Here's an example:

{
  "cluster_type": "elasticsearch",
  "details": "Could not make sure [ElasticsearchCluster(<redacted>)] is up and running",
  "caused_by": "no.found.constructor.plan.apm.ClusterNotReachable: Unexpected response [401 Unauthorized, {\"error\":{\"root_cause\":[{\"type\":\"security_exception\",\"reason\":\"unable to authenticate user [cloud-internal-enterprise_search-server] for REST request [/]\",\"header\":{\"WWW-Authenticate\":[\"Basic realm=\\\"security\\\" charset=\\\"UTF-8\\\"\",\"Bearer realm=\\\"security\\\"\",\"ApiKey\"]}}],\"type\":\"security_exception\",\"reason\":\"unable to authenticate user [cloud-internal-enterprise_search-server] for REST request [/]\",\"header\":{\"WWW-Authenticate\":[\"Basic realm=\\\"security\\\" charset=\\\"UTF-8\\\"\",\"Bearer realm=\\\"security\\\"\",\"ApiKey\"]}},\"status\":401}]"
}

Possible Solution

Noted above, but to make parsing easier:

We should be checking that we have a calculated plan for everything, rather than just assuming we'll get one as a good defensive programming practice. Mym recommendation is that each stack component and cluster in its corresponding slice should have the check and bubble the error up for a a nil calculated plan here: https://github.com/elastic/cloud-sdk-go/blob/master/pkg/api/platformapi/allocatorapi/vacate.go#L633-L718.

I also don't mind working on this in the near future, but if someone wants to tackle it within the next few weeks, feel free.

Steps to Reproduce

Get ES into a really bad state with another stack component running along side it and attempt to vacate said stack component.

Context

I think enough was provided above :).

Your Environment

Version used: v1.8.0 for all the things
Environment name and version (e.g. Go 1.9): 1.16

The text was updated successfully, but these errors were encountered:

Z3r0Sum added Team:Delivery tooling-requests requests for tooling team labels Feb 18, 2022

Z3r0Sum self-assigned this Feb 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be defensive around move_plan calculations for stack components #390

Be defensive around move_plan calculations for stack components #390

Z3r0Sum commented Feb 18, 2022

Be defensive around move_plan calculations for stack components #390

Be defensive around move_plan calculations for stack components #390

Comments

Z3r0Sum commented Feb 18, 2022

Readiness Checklist

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context

Your Environment