feat(esClientDrain): enhance Drain ES Client function #168

s-vkropotko · 2021-04-07T16:57:48Z

One-line summary

Update Drain functionality

Description

Add an ability to set parameters for retry logic with Resty lib.
Add handling for not finished shards migration.

About shards migration and Drain behavior

We have a kinda bug in waitForEmptyEsNode function.
In this function, after we finish a retry logic we don't check the response and proceed with the deletion of the pod but it can be wrong (shards can be still on the pod). So validation for this case is added.

Also, an additional logic added withremoveFromExcludeIPList. So if checking of remaining shards finishes without success now we remove the pod from the exclude list in ES cluster settings (before, it left in the excluded list).

There is the next idea behind it - ES-Op can change scaling direction after waiting for condition(migration of remaining shards from Pod) so it looks better to start with the state before Drain was started.

Types of Changes

What types of changes does your code introduce? Keep the ones that apply:

New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)

Review

List of tasks the reviewer must do to review the PR

Tests

Deployment Notes

To test this behavior ES-Op should be deployed with small values of retry logic options:

        - --esclient-retry-count=2
        - --esclient-retry-waittime=1s
        - --esclient-retry-max-waittime=10s

Add an ability to set parameters for retry logic with Resty lib Add handling for not finishned shards migration Signed-off-by: vkropotko <viacheslav.kropotko@zalando.de>

Signed-off-by: vkropotko <viacheslav.kropotko@zalando.de>

operator/es_client.go

otrosien · 2021-04-12T12:29:26Z

If I understand the PR correctly, it should fix this bug: #128

operator/es_client.go

Signed-off-by: vkropotko <viacheslav.kropotko@zalando.de>

s-vkropotko · 2021-04-14T18:40:42Z

#128 Yes, I think so. Sorry, I didn't check the list of existed issues 😃

otrosien · 2021-04-18T19:06:39Z

👍

otrosien · 2021-04-19T13:34:36Z

@mikkeloscar mind to also take a look?

mikkeloscar · 2021-04-19T14:34:18Z

operator/elasticsearch_test.go

@@ -59,7 +66,8 @@ func TestGetElasticsearchEndpoint(t *testing.T) {
 	customEndpoint, err := url.Parse(customURL)
 	assert.NoError(t, err)

-	esOperator = NewElasticsearchOperator(faker, nil, 1*time.Second, 1*time.Second, "", "", ".cluster.local.", customEndpoint)
+	esOperator = NewElasticsearchOperator(faker, nil, 1*time.Second, 1*time.Second, "", "", ".cluster.local.", customEndpoint,


nit: let's pass the RetryConfig from here instead of individual arguments.

mikkeloscar · 2021-04-19T15:14:13Z

operator/es_client.go

+		if shard.IP == podIP {
+			err = fmt.Errorf("Cannot migrate shards from pod '%s' with IP '%s' within provided intervals", pod.ObjectMeta.Name, pod.Status.PodIP)
+			// if we cannot remove node than return it back active nodes pool
+			if errExclude := c.undoExcludePodIP(pod); errExclude != nil {


While I understand the motivation for this, I'm not sure this is the best place to handle the issue you describe.

If we hit the drain timeout the right thing would be to retry draining which is handled elsewhere in operator.go. Otherwise we can end up draining a bit, re-enabling the pod so it gets shards back, and then attempt to drain it once more.

If we hit the drain timeout and the operator decides it's better to scale out and not drain, as I understand the issue described that this aims to solve, then I think it would be better to undo the draining in the operator.go code and not have it as a side effect of the es client drain.

Sorry, I don't understand what do you suggest.
The current implementation looks like this:
separate autoscaler -> getScalingDirection -> updateEDS(parameters)
ES cluster -> run operator -> run operator.operate() in some interval -> ... -> operatePods can run Drain func directly or from rescaleStatefulSet func -> control is transferred elaticserach.go -> Drain -> waiting for draining with some retry settings

The idea of the current code to provide users the ability to stop infinity waiting for draining (this can happen mostly because of configuration error between EDS and ES indices or some problems with pods/cluster).
My assumption was next: if users specify wrong settings and Drain stopped work then users can check logs and fix settings for their case.
ES-Op guarantees only that Drain will have a rollback in case of failure.

If we hit the drain timeout the right thing would be to retry draining which is handled elsewhere in operator.go. Otherwise, we can end up draining a bit, re-enabling the pod so it gets shards back, and then attempt to drain it once more.

Am I right that you suggest adding an additional setting for this retry? or you suggest to rid of the retry with Resty lib?
An additional setting on the operator level will produce kinda identical behavior as a specified already existed Resty settings besides checking cluster status. But this also introduces complexity to calculate the final max waiting interval.

If we hit the drain timeout and the operator decides it's better to scale out and not drain, as I understand the issue described that this aims to solve, then I think it would be better to undo the draining in the operator.go code and not have it as a side effect of the es client drain.

So do you suggest just wrap undoExcludePodIP func into new func, .e.g. undoDrain and run it with bound to Drain in operator.go?
Is there a case when Es-Op received an error from Drain func and doesn't need to run undoDrain beside the case when an additional retry option for Drain existed?
Also, I can try to implement the next functionality:

if Drain is unsuccessful -> leave the pod in the excluded list and set annotation/status on EDS

During the next operator run check that the previous status exists and if ES-Op doesn't decide to try to Drain it again -> run undoDrain for it

@mikkeloscar hey, could you please check my comment?

@mikkeloscar I would be good with this behaviour because it unblocks other cluster operations, but the users should have alerting on "Cannot migrate shards from pod" in the logs, because apparently there's some wrong configuration in the cluster to be fixed manually.

Hey, I'm sorry for not following up here earlier.

My main point was that the code added here in my mind is operator code and should belong in the logic of operator.go and not in the client for calling ES.

Right now, what would happen on a drain timeout is that the drain would be undone, by the new code added here, and the next time the operator loop runs, it would select the exact same pod again, because it has the drain annotation, and it would again start draining, so you basically don't win anything by having it here is my point.

One clear problem that I see with the current behavior (before this PR) is that on scale-out we don't stop the draining of the pod and then undo the drain, which would make sense if we were only draining the pod for scaledown previously. However, if the pod is being drained because of a node being shut down, then draining is exactly what we want, and it IMO shouldn't be stopped.

Apart from the scale-out scenario, I don't completely follow the idea of stopping the "infinite" drain in case the cluster is misconfigured. What would be the alternative? To stop the draining and then do what? I agree with @otrosien that the operators/owners of the EDS should have some sort of alert on these kind of misconfigurations, because it's not something the es-operator is able to handle by itself, so from my perspective it's best it just tries to drain and then a human can get alerted if this drain takes longer that expected (maybe we can create better logs or other metrics, to make it easier to create such alerts?).
@otrosien you mentioned that it "unblocks other cluster operations". Are there other operations than what I mentioned which are blocked during the drain? Maybe I'm missing some knowledge about this and are judging the thing wrong?

Hey, sorry for the delayed response.
Let's clarify several things:

I think I was no so clear with describing current behavior, it's not an infinity cycle to wait for the Drain condition.
Current behavior works next way:
function waitForEmptyEsNode conducts up to 999 retry operations with increasing interval (from SetRetryWaitTime to SetRetryMaxWaitTime values, or 10 -> 30 seconds in our case).
So this requires hours till it will be finished.
Then in waitForEmptyEsNode we ignore response
https://github.com/zalando-incubator/es-operator/pull/168/files#diff-67b73dbcc9b2b625a0749cfc62f02ad480ccd9d40abd4664e1455f01ff446965L282
And check only err which will be nil because unsuccessful retries are not an error condition.
(instead of that we need to check another parameter - https://github.com/zalando-incubator/es-operator/pull/168/files#diff-67b73dbcc9b2b625a0749cfc62f02ad480ccd9d40abd4664e1455f01ff446965R367)
After this we remove a target node(pod) from the K8S cluster despite this node was not correctly removed from ES cluster.

about unblocking other cluster operations, for example:
We have a cluster that is stuck in scale-down operation, e.g. cannot migrate shards to other data nodes.
If the user updated an EDS(increased minReplicas) and the current scaling direction would be UP then in the current implementation (before this PR) ES-Op will continue doing retries for scaling down.
If a user(or some system conditions) will restart ES-Op in this case then ES-Op will start to do the expected scaling up the operation. But here we have another issue - the IP address of the node that ES-Op tried to remove will be presented in exclude_list.

In the PR implementation, it will work as before (I didn't change existed Retry configuration) but we will have a graceful exit(ref 1) from failed Drain. Also, users will have the ability to configure the custom timeout for Drain which is useful for cases with regular scale in/down operations(ref 2), like scaling EDS by Cronjobs.

So according to your first comment, I think I can leave a fix for 1) and for 2) - add function undoDrain which contains undoExclude to operator.go

Also, we can add an additional parameter to control behavior in the case of undoDrain
@mikkeloscar

otrosien · 2021-12-17T18:22:19Z

I would like to take another look at this.

s-vkropotko requested review from mikkeloscar and otrosien as code owners April 7, 2021 16:57

feat(esClientDrain): enhance Drain ES Client function

2cd4257

Add an ability to set parameters for retry logic with Resty lib Add handling for not finishned shards migration Signed-off-by: vkropotko <viacheslav.kropotko@zalando.de>

s-vkropotko force-pushed the master branch from 9dcc5e0 to 2cd4257 Compare April 7, 2021 16:58

feat(drain): improve http error handling

ece1c4f

Signed-off-by: vkropotko <viacheslav.kropotko@zalando.de>

otrosien reviewed Apr 12, 2021

View reviewed changes

operator/es_client.go Outdated Show resolved Hide resolved

otrosien reviewed Apr 12, 2021

View reviewed changes

operator/es_client.go Outdated Show resolved Hide resolved

otrosien reviewed Apr 12, 2021

View reviewed changes

operator/es_client.go Show resolved Hide resolved

vkropotko added 2 commits April 14, 2021 20:21

fix: update naming and add mutex lock to cleanup func

559e287

Signed-off-by: vkropotko <viacheslav.kropotko@zalando.de>

fix: tests

d33bd55

Signed-off-by: vkropotko <viacheslav.kropotko@zalando.de>

mikkeloscar reviewed Apr 19, 2021

View reviewed changes

s-vkropotko closed this Aug 19, 2021

otrosien reopened this Dec 17, 2021

Merge branch 'master' into master

2895705

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(esClientDrain): enhance Drain ES Client function #168

feat(esClientDrain): enhance Drain ES Client function #168

s-vkropotko commented Apr 7, 2021 •

edited

otrosien commented Apr 12, 2021

s-vkropotko commented Apr 14, 2021

otrosien commented Apr 18, 2021

otrosien commented Apr 19, 2021

mikkeloscar Apr 19, 2021

mikkeloscar Apr 19, 2021

s-vkropotko Apr 20, 2021 •

edited

s-vkropotko Apr 26, 2021

otrosien May 14, 2021

mikkeloscar May 31, 2021

s-vkropotko Jun 16, 2021 •

edited

s-vkropotko Jun 16, 2021 •

edited

otrosien commented Dec 17, 2021

feat(esClientDrain): enhance Drain ES Client function #168

Are you sure you want to change the base?

feat(esClientDrain): enhance Drain ES Client function #168

Conversation

s-vkropotko commented Apr 7, 2021 • edited

One-line summary

Description

About shards migration and Drain behavior

Types of Changes

Review

Deployment Notes

otrosien commented Apr 12, 2021

s-vkropotko commented Apr 14, 2021

otrosien commented Apr 18, 2021

otrosien commented Apr 19, 2021

mikkeloscar Apr 19, 2021

Choose a reason for hiding this comment

mikkeloscar Apr 19, 2021

Choose a reason for hiding this comment

s-vkropotko Apr 20, 2021 • edited

Choose a reason for hiding this comment

s-vkropotko Apr 26, 2021

Choose a reason for hiding this comment

otrosien May 14, 2021

Choose a reason for hiding this comment

mikkeloscar May 31, 2021

Choose a reason for hiding this comment

s-vkropotko Jun 16, 2021 • edited

Choose a reason for hiding this comment

s-vkropotko Jun 16, 2021 • edited

Choose a reason for hiding this comment

otrosien commented Dec 17, 2021

s-vkropotko commented Apr 7, 2021 •

edited

s-vkropotko Apr 20, 2021 •

edited

s-vkropotko Jun 16, 2021 •

edited

s-vkropotko Jun 16, 2021 •

edited