Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be defensive around move_plan calculations for stack components #390

Open
4 tasks done
Z3r0Sum opened this issue Feb 18, 2022 · 0 comments
Open
4 tasks done

Be defensive around move_plan calculations for stack components #390

Z3r0Sum opened this issue Feb 18, 2022 · 0 comments
Assignees
Labels
Team:Delivery tooling-requests requests for tooling team

Comments

@Z3r0Sum
Copy link
Contributor

Z3r0Sum commented Feb 18, 2022

Readiness Checklist

  • I am running the latest version
  • I checked the documentation and found no answer
  • I checked to make sure that this issue has not already been filed
  • I am reporting the issue to the correct repository (for multi-repository projects)

Expected Behavior

If a move plan calculation cannot be computed, we should know that and then bubble the appropriate error up rather than assume it was calculated. An example will be outlined below.

Current Behavior

I tried updating to the latest cloud-sdk-go version and ecctl version locally and received the exact same stack trace as older versions (both are on v1.8.0), so I don't perceive this as a new problem and perhaps a backport should be considered when we fix it. Here's the crux of the problem:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1d3103d]

goroutine 90 [running]:
github.com/elastic/cloud-sdk-go/pkg/api/platformapi/allocatorapi.ComputeVacateRequest(0xc000320280, 0xc000799cd8, 0x1, 0x1, 0x2d18240, 0x0, 0x0, 0x0, 0xc0001297cb, 0xc0004fa121, ...)
	/Users/stephenschmidt/go/pkg/mod/github.com/elastic/cloud-sdk-go@v1.8.0/pkg/api/platformapi/allocatorapi/vacate.go:708 +0xc1d
github.com/elastic/cloud-sdk-go/pkg/api/platformapi/allocatorapi.newMoveClusterParams(0xc000497c70, 0xc0003d4920, 0x1, 0x1)
	/Users/stephenschmidt/go/pkg/mod/github.com/elastic/cloud-sdk-go@v1.8.0/pkg/api/platformapi/allocatorapi/vacate.go:404 +0x3a7
github.com/elastic/cloud-sdk-go/pkg/api/platformapi/allocatorapi.moveClusterByType(0xc000497c70, 0xc000497c70, 0x0)
	/Users/stephenschmidt/go/pkg/mod/github.com/elastic/cloud-sdk-go@v1.8.0/pkg/api/platformapi/allocatorapi/vacate.go:441 +0x45
github.com/elastic/cloud-sdk-go/pkg/api/platformapi/allocatorapi.VacateCluster(0xc000497c70, 0xc000488fa8, 0x135bf26)
	/Users/stephenschmidt/go/pkg/mod/github.com/elastic/cloud-sdk-go@v1.8.0/pkg/api/platformapi/allocatorapi/vacate.go:310 +0x77
github.com/elastic/cloud-sdk-go/pkg/api/platformapi/allocatorapi.VacateClusterInPool(0x22de040, 0xc000497c70, 0xc00005f501, 0x0)
	/Users/stephenschmidt/go/pkg/mod/github.com/elastic/cloud-sdk-go@v1.8.0/pkg/api/platformapi/allocatorapi/vacate.go:297 +0x53
github.com/elastic/cloud-sdk-go/pkg/sync/pool.Work.func2(0xc00010e180, 0xc00058e420, 0xc00058e600, 0xc00058e840, 0xc00058e960, 0xc00010e1e0, 0x21da570, 0x6fc23ac00, 0x22de040, 0xc000497c70, ...)
	/Users/stephenschmidt/go/pkg/mod/github.com/elastic/cloud-sdk-go@v1.8.0/pkg/sync/pool/worker.go:82 +0x44
created by github.com/elastic/cloud-sdk-go/pkg/sync/pool.Work
	/Users/stephenschmidt/go/pkg/mod/github.com/elastic/cloud-sdk-go@v1.8.0/pkg/sync/pool/worker.go:81 +0x21d

"Debugging" locally:

Computation complete payload: &{Failures:0xc0004f0080 Moves:0xc0004f0100}
Computation complete payload error: &{ApmClusters:[] AppsearchClusters:[] ElasticsearchClusters:[] EnterpriseSearchClusters:[] KibanaClusters:[]}
Computation complete payload moves: <nil>
We think this is ent search, plan config: <nil>

panic: runtime error: invalid memory address or nil pointer dereference

The problem is indeed the plan computation is returning nil and we don't bother to check for nil values here: https://github.com/elastic/cloud-sdk-go/blob/master/pkg/api/platformapi/allocatorapi/vacate.go#L708, but more specifically on the field:

https://github.com/elastic/cloud-sdk-go/blob/master/pkg/models/move_clusters_details.go#L54 -> https://github.com/elastic/cloud-sdk-go/blob/master/pkg/models/move_enterprise_search_details.go#L41

We should be checking that we have a calculated plan for everything, rather than just assuming we'll get one as a good defensive programming practice. Mym recommendation is that each stack component and cluster in its corresponding slice should have the check and bubble the error up for a a nil calculated plan here: https://github.com/elastic/cloud-sdk-go/blob/master/pkg/api/platformapi/allocatorapi/vacate.go#L633-L718.

At the end of the day it's a chicken and egg issue, so if ES is in a really bad state, something like entsearch can't be checked; thus, we can't calc a plan. Here's an example:

{
  "cluster_type": "elasticsearch",
  "details": "Could not make sure [ElasticsearchCluster(<redacted>)] is up and running",
  "caused_by": "no.found.constructor.plan.apm.ClusterNotReachable: Unexpected response [401 Unauthorized, {\"error\":{\"root_cause\":[{\"type\":\"security_exception\",\"reason\":\"unable to authenticate user [cloud-internal-enterprise_search-server] for REST request [/]\",\"header\":{\"WWW-Authenticate\":[\"Basic realm=\\\"security\\\" charset=\\\"UTF-8\\\"\",\"Bearer realm=\\\"security\\\"\",\"ApiKey\"]}}],\"type\":\"security_exception\",\"reason\":\"unable to authenticate user [cloud-internal-enterprise_search-server] for REST request [/]\",\"header\":{\"WWW-Authenticate\":[\"Basic realm=\\\"security\\\" charset=\\\"UTF-8\\\"\",\"Bearer realm=\\\"security\\\"\",\"ApiKey\"]}},\"status\":401}]"
}

Possible Solution

Noted above, but to make parsing easier:

We should be checking that we have a calculated plan for everything, rather than just assuming we'll get one as a good defensive programming practice. Mym recommendation is that each stack component and cluster in its corresponding slice should have the check and bubble the error up for a a nil calculated plan here: https://github.com/elastic/cloud-sdk-go/blob/master/pkg/api/platformapi/allocatorapi/vacate.go#L633-L718.

I also don't mind working on this in the near future, but if someone wants to tackle it within the next few weeks, feel free.

Steps to Reproduce

Get ES into a really bad state with another stack component running along side it and attempt to vacate said stack component.

Context

I think enough was provided above :).

Your Environment

  • Version used: v1.8.0 for all the things
  • Environment name and version (e.g. Go 1.9): 1.16
@Z3r0Sum Z3r0Sum added Team:Delivery tooling-requests requests for tooling team labels Feb 18, 2022
@Z3r0Sum Z3r0Sum self-assigned this Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Delivery tooling-requests requests for tooling team
Projects
None yet
Development

No branches or pull requests

1 participant