Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(esClientDrain): enhance Drain ES Client function #168

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

s-vkropotko
Copy link

@s-vkropotko s-vkropotko commented Apr 7, 2021

One-line summary

Update Drain functionality

Description

Add an ability to set parameters for retry logic with Resty lib.
Add handling for not finished shards migration.

About shards migration and Drain behavior

We have a kinda bug in waitForEmptyEsNode function.
In this function, after we finish a retry logic we don't check the response and proceed with the deletion of the pod but it can be wrong (shards can be still on the pod). So validation for this case is added.

Also, an additional logic added withremoveFromExcludeIPList. So if checking of remaining shards finishes without success now we remove the pod from the exclude list in ES cluster settings (before, it left in the excluded list).

There is the next idea behind it - ES-Op can change scaling direction after waiting for condition(migration of remaining shards from Pod) so it looks better to start with the state before Drain was started.

Types of Changes

What types of changes does your code introduce? Keep the ones that apply:

  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)

Review

List of tasks the reviewer must do to review the PR

  • Tests

Deployment Notes

To test this behavior ES-Op should be deployed with small values of retry logic options:

        - --esclient-retry-count=2
        - --esclient-retry-waittime=1s
        - --esclient-retry-max-waittime=10s

Add an ability to set parameters for retry logic with Resty lib
Add handling for not finishned shards migration

Signed-off-by: vkropotko <viacheslav.kropotko@zalando.de>
Signed-off-by: vkropotko <viacheslav.kropotko@zalando.de>
operator/es_client.go Outdated Show resolved Hide resolved
operator/es_client.go Outdated Show resolved Hide resolved
@otrosien
Copy link
Member

If I understand the PR correctly, it should fix this bug: #128

vkropotko added 2 commits April 14, 2021 20:21
Signed-off-by: vkropotko <viacheslav.kropotko@zalando.de>
Signed-off-by: vkropotko <viacheslav.kropotko@zalando.de>
@s-vkropotko
Copy link
Author

#128 Yes, I think so. Sorry, I didn't check the list of existed issues 😃

@otrosien
Copy link
Member

👍

@otrosien
Copy link
Member

@mikkeloscar mind to also take a look?

@@ -59,7 +66,8 @@ func TestGetElasticsearchEndpoint(t *testing.T) {
customEndpoint, err := url.Parse(customURL)
assert.NoError(t, err)

esOperator = NewElasticsearchOperator(faker, nil, 1*time.Second, 1*time.Second, "", "", ".cluster.local.", customEndpoint)
esOperator = NewElasticsearchOperator(faker, nil, 1*time.Second, 1*time.Second, "", "", ".cluster.local.", customEndpoint,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's pass the RetryConfig from here instead of individual arguments.

if shard.IP == podIP {
err = fmt.Errorf("Cannot migrate shards from pod '%s' with IP '%s' within provided intervals", pod.ObjectMeta.Name, pod.Status.PodIP)
// if we cannot remove node than return it back active nodes pool
if errExclude := c.undoExcludePodIP(pod); errExclude != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I understand the motivation for this, I'm not sure this is the best place to handle the issue you describe.

  • If we hit the drain timeout the right thing would be to retry draining which is handled elsewhere in operator.go. Otherwise we can end up draining a bit, re-enabling the pod so it gets shards back, and then attempt to drain it once more.
  • If we hit the drain timeout and the operator decides it's better to scale out and not drain, as I understand the issue described that this aims to solve, then I think it would be better to undo the draining in the operator.go code and not have it as a side effect of the es client drain.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand what do you suggest.
The current implementation looks like this:
separate autoscaler -> getScalingDirection -> updateEDS(parameters)
ES cluster -> run operator -> run operator.operate() in some interval -> ... -> operatePods can run Drain func directly or from rescaleStatefulSet func -> control is transferred elaticserach.go -> Drain -> waiting for draining with some retry settings

The idea of the current code to provide users the ability to stop infinity waiting for draining (this can happen mostly because of configuration error between EDS and ES indices or some problems with pods/cluster).
My assumption was next: if users specify wrong settings and Drain stopped work then users can check logs and fix settings for their case.
ES-Op guarantees only that Drain will have a rollback in case of failure.

If we hit the drain timeout the right thing would be to retry 
draining which is handled elsewhere in operator.go.  
Otherwise, we can end up draining a bit, re-enabling the pod so it gets shards back, 
and then attempt to drain it once more.

Am I right that you suggest adding an additional setting for this retry? or you suggest to rid of the retry with Resty lib?
An additional setting on the operator level will produce kinda identical behavior as a specified already existed Resty settings besides checking cluster status. But this also introduces complexity to calculate the final max waiting interval.

If we hit the drain timeout and the operator decides it's better to scale out and not drain, as I understand the issue described that this aims to solve, then I think it would be better to undo the draining in the operator.go code and not have it as a side effect of the es client drain.

So do you suggest just wrap undoExcludePodIP func into new func, .e.g. undoDrain and run it with bound to Drain in operator.go?
Is there a case when Es-Op received an error from Drain func and doesn't need to run undoDrain beside the case when an additional retry option for Drain existed?
Also, I can try to implement the next functionality:

  • if Drain is unsuccessful -> leave the pod in the excluded list and set annotation/status on EDS
  • During the next operator run check that the previous status exists and if ES-Op doesn't decide to try to Drain it again -> run undoDrain for it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikkeloscar hey, could you please check my comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikkeloscar I would be good with this behaviour because it unblocks other cluster operations, but the users should have alerting on "Cannot migrate shards from pod" in the logs, because apparently there's some wrong configuration in the cluster to be fixed manually.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I'm sorry for not following up here earlier.

My main point was that the code added here in my mind is operator code and should belong in the logic of operator.go and not in the client for calling ES.

Right now, what would happen on a drain timeout is that the drain would be undone, by the new code added here, and the next time the operator loop runs, it would select the exact same pod again, because it has the drain annotation, and it would again start draining, so you basically don't win anything by having it here is my point.

One clear problem that I see with the current behavior (before this PR) is that on scale-out we don't stop the draining of the pod and then undo the drain, which would make sense if we were only draining the pod for scaledown previously. However, if the pod is being drained because of a node being shut down, then draining is exactly what we want, and it IMO shouldn't be stopped.

Apart from the scale-out scenario, I don't completely follow the idea of stopping the "infinite" drain in case the cluster is misconfigured. What would be the alternative? To stop the draining and then do what? I agree with @otrosien that the operators/owners of the EDS should have some sort of alert on these kind of misconfigurations, because it's not something the es-operator is able to handle by itself, so from my perspective it's best it just tries to drain and then a human can get alerted if this drain takes longer that expected (maybe we can create better logs or other metrics, to make it easier to create such alerts?).
@otrosien you mentioned that it "unblocks other cluster operations". Are there other operations than what I mentioned which are blocked during the drain? Maybe I'm missing some knowledge about this and are judging the thing wrong?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry for the delayed response.
Let's clarify several things:

  1. I think I was no so clear with describing current behavior, it's not an infinity cycle to wait for the Drain condition.
    Current behavior works next way:
    function waitForEmptyEsNode conducts up to 999 retry operations with increasing interval (from SetRetryWaitTime to SetRetryMaxWaitTime values, or 10 -> 30 seconds in our case).
    So this requires hours till it will be finished.
    Then in waitForEmptyEsNode we ignore response
    https://github.com/zalando-incubator/es-operator/pull/168/files#diff-67b73dbcc9b2b625a0749cfc62f02ad480ccd9d40abd4664e1455f01ff446965L282
    And check only err which will be nil because unsuccessful retries are not an error condition.
    (instead of that we need to check another parameter - https://github.com/zalando-incubator/es-operator/pull/168/files#diff-67b73dbcc9b2b625a0749cfc62f02ad480ccd9d40abd4664e1455f01ff446965R367)
    After this we remove a target node(pod) from the K8S cluster despite this node was not correctly removed from ES cluster.
  2. about unblocking other cluster operations, for example:
    We have a cluster that is stuck in scale-down operation, e.g. cannot migrate shards to other data nodes.
    If the user updated an EDS(increased minReplicas) and the current scaling direction would be UP then in the current implementation (before this PR) ES-Op will continue doing retries for scaling down.
    If a user(or some system conditions) will restart ES-Op in this case then ES-Op will start to do the expected scaling up the operation. But here we have another issue - the IP address of the node that ES-Op tried to remove will be presented in exclude_list.

In the PR implementation, it will work as before (I didn't change existed Retry configuration) but we will have a graceful exit(ref 1) from failed Drain. Also, users will have the ability to configure the custom timeout for Drain which is useful for cases with regular scale in/down operations(ref 2), like scaling EDS by Cronjobs.

So according to your first comment, I think I can leave a fix for 1) and for 2) - add function undoDrain which contains undoExclude to operator.go

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we can add an additional parameter to control behavior in the case of undoDrain
@mikkeloscar

@otrosien
Copy link
Member

I would like to take another look at this.

@otrosien otrosien reopened this Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants