Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Backport bug fixes for v0.7.1 release #1329

Merged
merged 2 commits into from Jan 11, 2021

Conversation

vincepri
Copy link
Member

@vincepri vincepri commented Jan 11, 2021

This PR backports some bug fixes that have been merged to the mainline branch which are important to be included for the current release (v0.7.x).
/assign @alvaroaleman
/milestone v0.7.x

alvaroaleman and others added 2 commits January 11, 2021 08:09
The delegating Logger will cause a high number of memory allocations
when no actual Logger is set. This change makes us set a NullLogger
after 30 seconds if none was set yet to avoid that.
In admission v1, API server requires that Patch and PatchType are both
provided or none are provided.  Meanwhile, admission v1beta1 does not
have this kind of requirement.

In controller-runtime, PatchResponseFromRaw sets PatchType regardless
of the existence of patch.  If patch is empty, a response contains
only PatchType and API server does not like it.  Webhook call fails.

This change fixes this issue by not setting PatchType if patch is
empty.
@k8s-ci-robot k8s-ci-robot added this to the v0.7.x milestone Jan 11, 2021
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 11, 2021
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 11, 2021
@vincepri vincepri changed the title Backport bug fixes for v0.7.1 release 🌱 Backport bug fixes for v0.7.1 release Jan 11, 2021
Copy link
Member

@alvaroaleman alvaroaleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 11, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [alvaroaleman,vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 1d5261b into kubernetes-sigs:release-0.7 Jan 11, 2021
sonnysideup added a commit to dominodatalab/distributed-compute-operator that referenced this pull request Mar 6, 2021
controller-runtime 0.7.0 was causing API submissions to error out
whenever no default values were applied. this was fixed in version 0.7.1

kubernetes-sigs/controller-runtime#1329
sonnysideup added a commit to dominodatalab/distributed-compute-operator that referenced this pull request Mar 10, 2021
* creates raycluster api

along with other scaffolded files and types (i.e. controller)

* adds RayCluster API

* fixes linting error

* adds more badges to README

don't we need stinking badges???

* runs tests using bash shell

make target requires the use of `source' command

* sets bash as the default shell in Makefile

since this uses "/bin/sh" by default. if that path doesn't point to
bash, then the scripting will error out because the "source" command is
unavailable

* updates RayClusterSpec API doc strings

* changes controller-gen crds opts

generating CRDs that support the latest and greatest

* adds basic validation to RayClusterSpec fields

also sets fields with default values as "optional" since the v1 CRD spec
requires this distinction. "omitempty" accomplishes the same thing but
preference is lent to the former annotation since it clearly conveys our
intention.

update sample ray CR with a real field value; more will come

* adds generic metadata functions

for use by all resources

* adds RayClusterReconciler loop structure

and scaffolds structure for creating all the resources need to build a
ray cluster. serviceaccount is implemented. tests will come

* adds util funcs used to generate deployments

along with tests

* adds Volumes and VolumeMounts to RayClusterSpec

these will be used to mount additional volumes into cluster pods

* updates RayClusterSpec

fixes default repo for ray image
adds NodeSelector and Affinity fields

* changes RayClusterSpec

adds realistic ObjectStoreMemoryBytes min validation in accordance with
the app's requirements. Points to OCIImageDefinition object

* adds naive creation of head/worker deployments

naive in the sense the controller creates them w/o check, but the
generation of the deployment structs are well tested.

also adds a head service and empty stub files for other required
resources.

* moves util pkg out of resources

this is a bit cleaner, more descriptive

* oh so many changes

adds ClientServerPort to RayClusterSpec
exposes proper ports in head service for ray
changes default ray image to standard image (w/o extra ML libs)
refactors ray resource lib funcs and adds more unit tests

* adds more resources to controller reconciliation

creates network policy
creates and binds pod security policy
adds watches to controller for "owned" resources (still naive)
changes ObjectStoreMemoryBytes type to account for nil

* fixes ObjectStoreMemoryBytes field and processing

should be int64 so we can represent more than 2.2Gi
fixes bug with cmd args passing value instead of pointer addr

* reworks psp support

while we can create a PSP for every cluster, the logistics behind
managing both cluster and namespace scoped resources are gross. these
changes make it possible for users to provide a psp "name". the psp will
be verified and then use will be enabled via RBAC resources.

* bumps go version to 1.16

* another load of changes

enables primitive autoscaling
separates head/worker resource requests (different load profiles)
adds printer columns for `kubectl get' output
uses "recreate" strategy to manage head deployment
updates example RayCluster YAML

* PR feedback changes

exposes ScaleDownStabilizationWindowSeconds for HPA
adds ImagePullSecrets to spec for pulling private images
makes RayCluster.Spec field required

* adds network policy support for clients

* changes default docker IMG name

default value is not very descriptive

* adds pkg/ to dockerfile

required to build project since we leverage libraries contained therein

* adds docker image build/push steps to CI

* logging into registry

* adds docker labels and layer caching to ci

* fixes docker build in ci

corrects labels and uses registry cache instead of local fs

* removes useless comment

* removes Build from CI

step is implicitly performed during `make test'

* removes fmt vet targets from Makefile

and adds build step back to ci; that should handle the main caching bits

* updates golangci-lint and configures more linters

we can use this tool for vet, fmt and other linting checks instead of
running those checks as separate steps

* migrates from using "flag" to "cobra"

so we can add subcommands to managing CRDs

* adds crd-apply / crd-delete commands

* moves reconcileDeployments into a separate func

we're going to have to perform some advanced checks and the linting
error was also bothering me

* fixes Dockerfile

adds crd defs to Dockerfile so they can be embedded during build time.

* adds beginnings of a helm chart

added Namespace to manager config so we can watch resources within the
operator deployment namespace. without this, the manager requires
cluster-scope watch permissions which does not really make sense.

* makes netpol client labels configurable

users can provide their own client labels when desired; default stays
the same as before

* removes helm chart service.yaml

and adds an update to the generated crd that was missed in the last
commit

* decouples head/worker common attributes

these changes make it possible to assign labels, annotations, volumes,
etc..., to head and worker nodes separately. we can add "global" fields
if/when necessary

* adds missing "start" cmd to `make run'

* removes superfluous "hack" files from generator

we're not using this so there's no need to preserve the original
dir/args

* excludes unit tests from gocyclo linting

in the spirit of writing clearer tests instead of making everything
modular

* adds extra global settings to RayClusterSpec

adds pod security context
adds extra env vars added to all ray containers
adds the ability to use an external service account

* enhances hpa creation

exposes optional field for MinReplicas. when specified, that value will
be applied to the underlying HPA. when absent, the worker replicas value
will be set as a minimum.

* corrects cluster-owned resource names

adding some reference to "ray" makes it a lot easiser to differentiate
cluster resources from others in the same NS

* tweaks exhaustive linter

don't really need to include ComponentNone in switch checks since that
value is used elsewhere

* adding coverage report

* fixes leftover exhaustive linting errors

* updates README and adds notes

* splits controller processResources into separate funcs

the optional creation of the underlying resources and the updates that
will need to be perform for only certain resources merit expanding this
single func into separate, dedicated, resource-specific funcs

* adds service updates to reconciliation loop

this will cause the head service ports to change whenever related spec
field values change.

* fixes printercolumn

forgot to update this when separating head/worker fields api

* adds silly logo

* updates README

a little more shine. shine on

* fixes deployment creation

default service account name never updated to reflect "-ray" change
sets both head/worker deployment strategy to "Recreate"

If the image tag is updated, the head pod and some worker pods will
terminate immediatetly. the remaining worker pods can remain in an error
loop for an extended period of time if the new image is large.
furthermore, there's no telling how the ray cluster will behave when
mixing different versions. a simple solution for now is to recreate the
entire cluster from scratch using the new image.

* adds advanced resource reconciliation

now updates owned resources when parent CR fields change

* adds head/worker pod names to CR status

* updates .dockerignore

avoids uploading 250Mi+ into docker context when building

* updates ray controller

adds metadata to "owned" event sources for adding visibility
uses company-specific patch annotations for tracking updates

* updates helm chart rbac permissions

provides the extra update/delete rights for managing updates
grants pod list rights so we can add their names to the spec status

* updates RayClusterReconciler rbac annotations

so they're in-line with what's actually required by the operator and
generate the appropriate rbac resource for deploy via `kustomize'

* adds RayCluster webhooks

adding defaulting and validation webhooks to controller
ignores long kubebuilder comment annotations when linting
adds back boilerplate text (required by `operator-sdk' generators)

* makes ginko test suites more usable

setting the envtest kubebuilder assets dir so that we can run tests w/o
having to invoke `make test`. separates test assets install procedure
into a separate make target.

* adds validation webhook logic

adding `omitempty` tag to optional fields to prevent serializations errs
migrates primitive validation from CRD to webhook
adds first set of integration tests (validation)

* adds webhook support to helm chart

* moves helm chart into nested dir

* refactors helm chart macro names

i'm getting tired of typing "distributed-compute-operator" and since
this is NOT a library chart, it seems fine to use whatever prefix i
prefer

* renames logo file

* adds development setup script

* fixes mutating webhook / removes operator-sdk

the project was revised to use `kubebuilder' instead of `operator-sdk'
for a number of reasons:

- operator-sdk uses kubebuilder under the hood
- kubebuilder improves faster than operator-sdk
- we do not need the extra OLM and scorecard nonsense
- the code itself is not directly reliant on one or the other

* fixes github workflow

build target name changed

* adds SHELL override back to Makefile

* adds desired code formatter

so devs can format their code before pushing

* build, test, lint

* points to local goimports bin in Makefile

* adds go tooling cache to ci

* removes lint target from test/build/run

* run linting before uploading coverage report

* adds caches for go build cache in CI

and updates ginkgo/gomega

* adds a way to disable webhooks

to aid in development, an envvar can be used to disable webhooks and
execute the project outside k8s

* removes "test" from "docker-build" make target

i want this, i don't want this. bottom line, we run these steps
separately in CI so we'll always test before building a docker image.
devs will just need to be diligent...eh?

* reworks hpa

exposes "scale" subresource on RayCluster
uses "scale" subresource to scale worker deployment
targeting RayCluster with HPA instead of deployment

* fixes linting issue

* fixes patch issue with defaulting webhook

controller-runtime 0.7.0 was causing API submissions to error out
whenever no default values were applied. this was fixed in version 0.7.1

kubernetes-sigs/controller-runtime#1329

* moves all validation/defaulting into webhooks

reworks HPA creation behavior
adds integration tests for webhook logic

* removes RayClusterWorker.Replicas field requirement

we already default this field and validate use input. making this
optional allows the user to specify other worker fields (e.g. resources)
without having to explicitly provide a replica count as well.

* modifies notes

* bumps controller-runtime to latest version

version 0.8 allows us to excluded error stack traces in logs. this does
3 things:
- produces less noise in production logs
- forces us to add context to errors with wrapping
- still allows us to trace errors in development when fixing bugs

* exposes all log config opts in helm chart

removes "development mode" default

* adds TypeMeta to generated k8s resources

this helps us log type information when creating owned resources

* uses WithName logger func correctly

* updates crd after bumping k8s api dependency

minor changes moving from 1.19 to 1.20

* adds decent controller logging

logging create/update/delete of owned components
debug logging for status updates and object patches
debug++ logging for reconciliation loop triggers

* ignoring magic number in logger V() func

...because

* adds script to help with developer workflow tasks

* fixes RayCluster status updates

when updates to a stale object occur, we just requeue the key so it can
be reprocessed with the latest version of the RayCluster object.

* adds image tag prefix const to development script

* enforces worker cpu resource requests when autoscaling is enabled

* refactor development script to split minikube profile name out from common name (which represents image, chart, and deployment names)

* bumps controller-runtime version

0.8.3 removes deprecation warnings that were muddling the logs

* fixes issue with script and refactors some more

change MINIKUBE_PROFILE to evaluate as $MINIKUBE_PROFILE
adds extra vars with defaults to allow users override capabilities

* extracts context logging from RayClusterReconciler

now this can be used by any reconciler or other types that require the
need to propagate a contextual logger

* removes superfluous TypeMeta from generated objs

this was only added to enrich log statements. instead, we can grab the
type metadata directly from the scheme and add it to the logs prior to
create/update operations.

* adds developer comment to reconcileAutoscaler

indicates the reason why the shape of this operation is slightly
different from the other reconcile functions.

* standardizes approach to checking apierrors.IsNotFound

using ifs to assert whether we need to take one action or another, or
return an err

* changes image defaulting behavior

default image definition is added when completely missing. if a user
provides an image, then we valid that the minimum required fields are
present.

Co-authored-by: Adam Udi <adam.udi@dominodatalab.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants