Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RollingUpgrade not consistent #309

Open
preflightsiren opened this issue Oct 12, 2021 · 18 comments
Open

RollingUpgrade not consistent #309

preflightsiren opened this issue Oct 12, 2021 · 18 comments

Comments

@preflightsiren
Copy link

Is this a BUG REPORT or FEATURE REQUEST?: Bug

What happened: A RollingUpgrade created after a modification to launch templates via instance-manager did not detect any nodes that need to be recycled. To resolve the RollingUpgrade is deleted, and instance-manager is restarted, recreating the RollingUpgrade

What you expected to happen: Node is detected as being out-of-sync and replaced.

How to reproduce it (as minimally and precisely as possible): I can't consistently recreate this, but this happens often during our monthly patching cycle.

Anything else we need to know?:

Environment:

  • rolling-upgrade-controller version: 1.0.2
  • Kubernetes version : 1.19.7 | 1.20.10
$ kubectl version -o yaml

Other debugging information (if applicable):

  • RollingUpgrade status:
$ kubectl describe rollingupgrade <rollingupgrade-name>

The rollingupgrade has already been replaced. I did see that the state was "completed".

  • controller logs:
$ kubectl logs <rolling-upgrade-controller pod>
2021-10-12T00:16:48.561Z	INFO	controllers.RollingUpgrade	admitted   new rolling upgrade	{"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "update strategy": {"type":"randomUpdate","mode":"eager","maxUnavailable":1,"drainTimeout":2147483647},   "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
--
2021-10-12T00:16:55.636Z	INFO	controllers.RollingUpgrade	scaling   group details	{"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "desiredInstances": 1, "launchConfig": "", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:16:55.736Z	INFO	controllers.RollingUpgrade	checking   if rolling upgrade is completed	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:16:55.736Z	INFO	controllers.RollingUpgrade	no   drift in scaling group	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:16:55.836Z	INFO	controllers.RollingUpgrade	rolling   upgrade ended	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6",   "status": "completed"}
2021-10-12T00:17:25.837Z	INFO	controllers.RollingUpgrade	rolling   upgrade ended	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6",   "status": "completed"}
2021-10-12T00:17:47.935Z	INFO	controllers.RollingUpgrade	rolling   upgrade ended	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6",   "status": "completed"}

***** RollingUpgrade is delete and instance-manager restarted *****

2021-10-12T00:20:59.590Z	INFO	controllers.RollingUpgrade	rolling   upgrade resource not found, deleted object from admission   map	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:21:59.133Z	INFO	controllers.RollingUpgrade	admitted   new rolling upgrade	{"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "update strategy": {"type":"randomUpdate","mode":"eager","maxUnavailable":1,"drainTimeout":2147483647},   "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z	INFO	controllers.RollingUpgrade	scaling   group details	{"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "desiredInstances": 1, "launchConfig": "", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z	INFO	controllers.RollingUpgrade	checking   if rolling upgrade is completed	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z	INFO	controllers.RollingUpgrade	drift   detected in scaling   group	{"driftedInstancesCount/DesiredInstancesCount":   "(1/1)", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z	INFO	controllers.RollingUpgrade	selecting   batch for rotation	{"batch size": 1, "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z	INFO	controllers.RollingUpgrade	rotating   batch	{"instances": ["i-0017fd066bbfd0e32"],   "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z	INFO	controllers.RollingUpgrade	setting   instances to in-progress	{"batch":   ["i-0017fd066bbfd0e32"], "instances(InService)":   ["i-0017fd066bbfd0e32"], "name": "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:02.032Z	INFO	controllers.RollingUpgrade	setting   instances to stand-by	{"batch": ["i-0017fd066bbfd0e32"],   "instances(InService)": ["i-0017fd066bbfd0e32"],   "name": "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:02.433Z	INFO	controllers.RollingUpgrade	operating   on existing rolling upgrade	{"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "update strategy": {"type":"randomUpdate","mode":"eager","maxUnavailable":1,"drainTimeout":2147483647},   "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:05.033Z	INFO	controllers.RollingUpgrade	scaling   group details	{"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "desiredInstances": 1, "launchConfig": "", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:05.033Z	INFO	controllers.RollingUpgrade	checking   if rolling upgrade is completed	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:05.033Z	INFO	controllers.RollingUpgrade	drift   detected in scaling   group	{"driftedInstancesCount/DesiredInstancesCount":   "(1/1)", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
@preflightsiren
Copy link
Author

I have 2 ideas where the fault could be.

  1. in IsScalingGroupDrifted() - awsprovider.SelectScalingGroup(r.RollingUpgrade.ScalingGroupName(), r.Cloud.ScalingGroups) might return &autoscaling.Group{} leaving for _, instance := range scalingGroup.Instances to skip without doing any work.
  2. IsInstanceDrifted() might not detect any changes due to error in aws request / caching.

unfortunately there's no more verbose logging I can enable to validate.
I'm going to try setting forceRefresh:true in the instanceGroups, and this might help diagnose into which is more likley.

@eytan-avisror
Copy link
Contributor

eytan-avisror commented Oct 12, 2021

This can be due to caching, is the modification to launch template happens outside of instance-manager? i.e. manually, or through some other process? When instance-manager makes a change (e.g. you change some configuration), caching is not in the picture since that write invalidates the cached item.

upgrade-manager is flushing the cache before every upgrade, so when a CR is received the cache is clean as far as I know.

@eytan-avisror
Copy link
Contributor

@preflightsiren can you add your instance-group YAML and the resulting rolling-upgrade YAML as well

@preflightsiren
Copy link
Author

Interesting. The changes are via instance-manager. We just changed it to use forceRefresh: true which tells me there's something inside IsInstanceDrifted() which is acting weird.

I'll try and get more details when we have a failure.

@preflightsiren
Copy link
Author

preflightsiren commented Oct 18, 2021

had this happen today during patching I've attached an example of a working and affected InstanceGroup and RollingUpgrade yaml. 309.zip

I think more interesting is the logs for the broken RollingUpgrade



2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade checking if rolling upgrade is completed {"name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
--
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-005a537c368db9bc8", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-052105961e93c27df", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-056a89fcf2f036606", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0631bae2cdbaded68", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0692b82e4cd26c1d8", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0a6180cc3f1658bb9", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0f7155d519873057f", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade no drift in scaling group {"name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}


@preflightsiren
Copy link
Author

looks like upgrade.go#457 is being hit, meaning kubeprovider.SelectNodeByInstanceID(instanceID, r.Cloud.ClusterNodes) is failing

@preflightsiren
Copy link
Author

finally got around to spending some time debugging, it looks like the r.Cloud state becomes corrupt / out of date.
the raw log:

{"cluster":"uw2p-akp-b3","kubernetes":{"container_id":"docker://97face6b823c2bb6efac2b7a5c3f44f6401031f6327a7d99dc5c90aab93c65a3","container_image":"493203180918.dkr.ecr.us-west-2.amazonaws.com/akp-mirror/docker.io/keikoproj/rolling-upgrade-controller:v1.0.2-seb","container_name":"rolling-upgrade-controller","host":"ip-172-24-138-175.us-west-2.compute.internal","labels":{"app":"rolling-upgrade-controller","pod-template-hash":"76cd7b4ccd"},"namespace_name":"upgrade-manager","pod_id":"bee06a61-5ae5-4141-b27d-dadd0bae0eeb","pod_ip":"100.65.225.107","pod_name":"rolling-upgrade-controller-76cd7b4ccd-rm6tj","pod_owner":"ReplicaSet/rolling-upgrade-controller-76cd7b4ccd"},"log":"2022-02-14T01:18:54.252Z\tINFO\tcontrollers.RollingUpgrade\tNil Node details\t{\"instanceID\": \"i-00b789cfd28cd5942\", \"clusterNodes\": [\"ip-172-24-137-169.us-west-2.compute.internal\", \"ip-172-24-138-30.us-west-2.compute.internal\", \"ip-172-24-137-205.us-west-2.compute.internal\", \"ip-172-24-139-28.us-west-2.compute.internal\", \"ip-172-24-138-231.us-west-2.compute.internal\", \"ip-172-24-139-241.us-west-2.compute.internal\", \"ip-172-24-138-206.us-west-2.compute.internal\", \"ip-172-24-136-136.us-west-2.compute.internal\", \"ip-172-24-141-90.us-west-2.compute.internal\", \"ip-172-24-140-216.us-west-2.compute.internal\", \"ip-172-24-136-181.us-west-2.compute.internal\", \"ip-172-24-137-217.us-west-2.compute.internal\", \"ip-172-24-139-39.us-west-2.compute.internal\", \"ip-172-24-136-125.us-west-2.compute.internal\", \"ip-172-24-137-230.us-west-2.compute.internal\", \"ip-172-24-136-128.us-west-2.compute.internal\"]}","stream":"stderr"}

the node being rotated:
ip-172-24-138-175.us-west-2.compute.internal
providerID: aws:///us-west-2b/i-00b789cfd28cd5942

list contains 15 nodes, cluster has 145 nodes present, 21 uncordoned.

I could continue to dig, but it may just be worthwhile periodically restarting upgrade-manager like it's a NT2k program :)

@eytan-avisror
Copy link
Contributor

@preflightsiren can you check out #317 and see if that fixes the issue? I believe it should

@preflightsiren
Copy link
Author

Maybe. It seems unlikely as the cache contains the list of nodes in the cluster, not information from ec2.
Is there a connection I'm not aware of?
Ill pull this patch into my debug branch and see if we still see the error.

@eytan-avisror
Copy link
Contributor

eytan-avisror commented Feb 19, 2022

The cache contains the launch template / versions, if a new version is created and the controller is unaware it would essentially skip the upgrade

@preflightsiren
Copy link
Author

Ok ive deployed the patched version to our environment. 🤞
We'll know with more certainty in about 2-3weeks

@preflightsiren
Copy link
Author

Even with #317 we still have the issue. I added some debug logging and can see:

Nil Node details	{"instanceID": "i-05d7694376d812689", "clusterNodes": []}

the r.Cloud.ClusterNodes is empty

@eytan-avisror
Copy link
Contributor

eytan-avisror commented Apr 12, 2022

Not sure how this is the case @shreyas-badiger any idea?
Could this be a bug with how we use Event Filters here:

// NodesEventHandler will fetch us the nodes on corresponding events, an alternative to doing explicit API calls.
func (r *RollingUpgradeReconciler) NodeEventsHandler() predicate.Predicate {
return predicate.Funcs{
CreateFunc: func(e event.CreateEvent) bool {
nodeObj, ok := e.Object.(*corev1.Node)
if ok {
nodeName := e.Object.GetName()
log.Debug("nodeEventsHandler[create] nodeObj created, stored in sync map", "nodeName", nodeName)
r.ClusterNodesMap.Store(nodeName, nodeObj)
return false
}
return true
},
UpdateFunc: func(e event.UpdateEvent) bool {
nodeObj, ok := e.ObjectNew.(*corev1.Node)
if ok {
nodeName := e.ObjectNew.GetName()
log.Debug("nodeEventsHandler[update] nodeObj updated, updated in sync map", "nodeName", nodeName)
r.ClusterNodesMap.Store(nodeName, nodeObj)
return false
}
return true
},
DeleteFunc: func(e event.DeleteEvent) bool {
_, ok := e.Object.(*corev1.Node)
if ok {
nodeName := e.Object.GetName()
r.ClusterNodesMap.Delete(nodeName)
log.Debug("nodeEventsHandler[delete] - nodeObj not found, deleted from sync map", "name", nodeName)
return false
}
return true
},
}
}

Do we ever do an initial list nodes when the controller starts up? if we don't it will take the controller 30-60 seconds to fill up the cache from node events initially.

@preflightsiren is it possible that this happens when the controller restarts? or is it spun up right when an upgrade is submitted?

@eytan-avisror
Copy link
Contributor

I just checked and that's not the case, even with event filters the watch lists the nodes immediately when it's registered.
Let me dig further and see if I can understand why this is happening.

@preflightsiren
Copy link
Author

is it possible that this happens when the controller restarts?
no, the controller was not restarted during the upgrade.

I have an example of a node that was joined to the cluster long before the RollingUpgrade was created, was not found the list of ClusterNodes and the ClusterNodes was not nil (it had 143 nodes in the cluster).

hope this adds some more flavour :)

@shreyas-badiger
Copy link
Collaborator

Even with #317 we still have the issue. I added some debug logging and can see:

Nil Node details	{"instanceID": "i-05d7694376d812689", "clusterNodes": []}

the r.Cloud.ClusterNodes is empty

@preflightsiren can you confirm if the nodes were "Ready" or "Not Ready" at this point in time? Interested to know whether nodes were already part of the cluster or not.

@preflightsiren
Copy link
Author

Sorry for the delay in the response, we've been busy running debugging during the upgrade process.
During the upgrade process there's always a mix of ready and unready nodes, there's always ready nodes in the cluster servicing requests.

The one patch we've applied that seems to have the best result is setting --max-parallel=1 since we've applied that every environment and region has completed without issue.

@CAR6807
Copy link

CAR6807 commented Dec 14, 2022

Bump,I've ran into this is if the new node coming up never joins the cluster (bad AMI update etc) but the EC2 is valid member of the ASG.
In my case I reverted the bad ami but the RU is stuck since the new node never joins so it can't continue since
The cache map of kube nodes and instance is out of whack.
Force removing the the instances from ASG/ and recycling the controller (to clear the cache map) and redeploying the CRDs works.
Love the project btw. Issue happens every now and then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants