Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use metadata only watches for configmaps and secrets resources on workload clusters #3887

Closed
akutz opened this issue Oct 29, 2020 · 7 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@akutz
Copy link
Contributor

akutz commented Oct 29, 2020

User Story

As an operator,
I would like CAPI controllers to use metadata-only watches for secrets resources on the management cluster,
Because otherwise CAPI has trouble scaling to large environments with hundreds/thousands of workload clusters.

As an operator,
I would like CAPI controllers to use metadata-only watches for configmaps and secrets resources on workload clusters,
Because otherwise CAPI has trouble scaling to large environments with hundreds/thousands of workload clusters.

As an operator,
I would like CAPI controllers to use metadata-only watches for configmaps and secrets resources on workload clusters,
Because otherwise anyone on a workload cluster has the ability to force an OOM error in the CABPK and KCP pods on the management server by creating enough configmaps or secrets resources to exceed the memory limit of those pods on the management cluster.

Detailed Description

Today Cluster API watches (1) many different resources, but here are some that are particularly relevant to this issue:

Resource Type Resource Origin Controller Controller Location Reason
configmaps Management Cluster CABPK Management Cluster Control plane init mutex
CRS react to changes for applying data into workload cluster
Workload Cluster(s) KCP CoreDNS
KCP Kubelet config
KCP Kubeadm config
nodes KCP Getting control plane nodes
Machine Finding node ref
MachinePool Finding node ref
secrets ManagementCluster CABPK spec.files from secret ref
CRS react to changes for applying data into workload cluster
Workload Cluster(s) CAPBK Bootstrap token refresh

These resources are important, especially the configmaps and secrets. A secrets resource is created on the management cluster by CAPBK for every machine in a workload cluster. In CAPI v1alpha2, before CAPBK, this resource was around 16KiB in size. Today it is 45KiB. That means for 300 guest clusters with three nodes each, secrets resources alone will consume ~40MiB in each cache for a controller that is watching, explicitly or implicitly, secrets resources.

The situation with configmaps is even more concerning. Because CAPI, from the management cluster, watches and caches resources from all workload clusters, it means at best it is simply not possible to accurately estimate the memory resource requirements for CAPI pods. At worst it means a rogue workload cluster administrator could willfully and maliciously ensure that CAPI pods running on the management cluster are terminated with OOM errors by creating configmaps and secrets resources on a workload cluster.

While it will not solve any of these issues completely (because of labels and annotations), CAPI should try to adopt metadata-only watches (kubernetes-sigs/controller-runtime#1174) for configmaps and secrets resources, opting to do live reads where necessary. This is the minimum CAPI should consider, but in the future consider extending this strategy to include additional resources as well.

Anything else you would like to add:

/kind feature

  1. These may not be explicit watches. Per Support APIReader/NoCache option for ControllerUtil.CreateOr(Update|Patch) functions to avoid OOM errors controller-runtime#1222, the controller-runtime client sets up an implicit watch for the type of resource being fetched as part of a Get or List call.
@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 29, 2020
@fabriziopandini
Copy link
Member

/milestone v0.4.0

@k8s-ci-robot k8s-ci-robot added this to the v0.4.0 milestone Oct 30, 2020
@fabriziopandini
Copy link
Member

backporting will be nice, but it requires some work in older versions of controller runtime

@ncdc
Copy link
Contributor

ncdc commented Oct 30, 2020

Other possible things to consider:

  • We could not use a caching client for remote clusters from ClusterCacheTracker, or make it configurable
  • We could scope a remote cluster cache to a specific namespace, so we only cache items in a single namespace

@akutz
Copy link
Contributor Author

akutz commented Oct 30, 2020

We could scope a remote cluster cache to a specific namespace, so we only cache items in a single namespace

This is interesting. It would be really nice if we could create some sort of caching intermediary, like a squid proxy for the API server, so that the cache for remote clusters would live in the same namespace as the objects that represent those clusters.

@fabriziopandini
Copy link
Member

@vincepri is there some work missing for this issue after #4023 is merged?
Should we backport/is it possible to backport given that this change is deeply linked to the controller runtime version?

@vincepri
Copy link
Member

This has been fixed by #4023 and #3986 for v0.3.x

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closing this issue.

In response to this:

This has been fixed by #4023 and #3986 for v0.3.x

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

5 participants