Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watches are not working for some controller after informers_map.go:204: watch of *v1alpha1.SFService ended with: too old resource version: 79387199 (79398464) #869

Closed
vivekzhere opened this issue Mar 20, 2020 · 5 comments
Labels
kind/support Categorizes issue or PR as a support question.
Milestone

Comments

@vivekzhere
Copy link

We have four controllers for four different CRs in our operator. Once in a while we see these logs like

10:45:57.185826       1 reflector.go:326] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:204: watch of *v1alpha1.SFService ended with: too old
resource version: 79387199 (79398464)

these comes for three of the CRs. For one CR this log does not come. After that the controller for this one resource is not processing any request. Only a restart of the operator fixes this.

The CR for which watch is failing as huge number of resources (~30K). While the other reources are less than 10K in number.

Also we a have two instance of the operator running with leader election. We see a pattern that this issue happens when the slave becomes master and after few minutes go down (for update) and the other one becomes master again. In the first switch over the log for all four resources are seen. But in the second switch over only three resources are watched

We are using controller-runtime 0.4.0.

(kubernetes/kubernetes#22024)

@alvaroaleman
Copy link
Member

@vivekzhere the log message is generally nothing to worry about and expected to occasionally appear: kubernetes/kubernetes#22024 (comment)

After that the controller for this one resource is not processing any request. Only a restart of the operator fixes this.

That sounds like a bug. Does your reconciler still get requests and blocks when processing them or does it not even get them?

We see a pattern that this issue happens when the slave becomes master and after few minutes go down (for update) and the other one becomes master again.

I don't follow, are you referring to leader election with the slave/master terminology? If yes, I don't understand the and the other one becomes master again. The one that was initially not having the leader lease can only get the leader lease if the current leader drops it. That should only happen if it is stopped (unless there are timeouts refreshing the lease). If it is stopped, how can it later on become master again?

Could you list the sequence in which the instances get started/stopped and get the leader lease?

@vivekzhere
Copy link
Author

vivekzhere commented Mar 23, 2020

@alvaroaleman

That sounds like a bug. Does your reconciler still get requests and blocks when processing them or does it not even get them?

No. The reconciler for this one controller is not getting any requests at all.

I don't follow, are you referring to leader election with the slave/master terminology?

Yes. I am referring to leader election with the slave/master terminology.

Could you list the sequence in which the instances get started/stopped and get the leader lease?

There are two instances of the operator running(inst-0 and inst-1).
Initially inst-0 is leader. Then inst-0 goes down for upgrade and inst-1 becomes the leader. The upgrade process takes some time(at least a couple of minutes). So inst-1 starts reconciling resources. Here all controllers are working correctly in inst-1. Then inst-0 comes back up after upgrade and is waiting for leader lease. Then inst-1 goes down for upgrade and inst-0 becomes leader again. Now in inst-0 only three controllers are reconciling resources. Reconciler for one controller is not receiving any request.

PS: inst-0 and inst-1 are deployed on two different vms. We are not deploying the operator on k8s We provide the kubeconfig for the k8s api server within the vm.

Now we have identified that issue happens only during upgrades which requires the recreation of these vms. In such cases the issue is happening every time. But we are not seeing the issue during upgrades which just replaces the binaries of operator and does not do a vm recreate.

@alvaroaleman
Copy link
Member

Now we have identified that issue happens only during upgrades which requires the recreation of these vms. In such cases the issue is happening every time. But we are not seeing the issue during upgrades which just replaces the binaries of operator and does not do a vm recreate.

Okay, so are you sure this is an issue with controller-runtime then?

@vivekzhere
Copy link
Author

@alvaroaleman We are not really sure this is a controller runtime issue anymore.

We were doing kustomize build config/crd | kubectl apply -f - before starting inst-0 and inst-1. So during an update this was happening twice which is redundant (But I assumed this should not cause any issue). So we removed this from inst-1. Now after this change we are not able to reproduce the issue.

Elaborating a little more on our setup. We have two api servers also deployed along with the operator. On the on VM on which inst-0 is running, an instance of api server is also running and inst-0 points to this api server. Similarly, on the VM on which inst-1 is running, an instance of api server is also running and inst-1 points to this api server. Both api servers point to the same etcd.

@vincepri vincepri added this to the Next milestone Mar 26, 2020
@vincepri
Copy link
Member

/triage support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

4 participants