Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes deployment #2353

Merged
merged 48 commits into from
Jan 10, 2022
Merged

Kubernetes deployment #2353

merged 48 commits into from
Jan 10, 2022

Conversation

carlobeltrame
Copy link
Member

@carlobeltrame carlobeltrame commented Dec 14, 2021

This PR adds a helm chart which allows to install ecamp3 on any Kubernetes, even multiple times. It also adds the ability for feature branch deployments.

Fixes #2283, closes #1883

How to do a feature branch deployment: To deploy a feature branch, first, do a code review, because deploying gives the code in the PR access to secrets such as login credentials to the Kubernetes cluster, and consequently all secrets and data of all environments. Once you have reviewed the code for malicious changes, simply set the deploy! label on the PR, and it will be deployed within the next hour.

About helm: Helm is the package manager of Kubernetes, similar to apt-get, composer or npm. A chart (package in helm) basically contains a bunch of Kubernetes resource definitions in YAML files. All these YAML files can be templates, i.e. parts of them can be dynamically filled with configuration (e.g. the API domain, or the Sentry DSN, or the Docker image tag that should be deployed). So instead of our deploy.sh script, we use these templates to fill in environment variables etc.. All configurable values are listed in values.yaml with their defaults.

About Kubernetes resources: "Resources" in kubernetes means everything that runs or lives on the cluster. All resources can be described in YAML files. Examples of resources:

  • a deployment describes a pod (which runs a single container in most cases) and associated metadata. E.g. which docker image should be deployed, which environment variables are defined in the pod / container, how many replicas of the pod should be running, how should the liveness and readiness of the pod be checked, how are new versions of the pods rolled out etc.. This is more or less the replacement for the docker-compose.yml we had before, but a lot fancier and more automatic.
  • a service allows to expose and reference ports of a pod in the cluster-internal network
  • an ingress allows to expose a service on the internet. This is the replacement for the nginx reverse proxy we had before, and is indeed realized internally using an nginx.
  • config maps and secrets contain settings and secrets for the pods, and can be mounted into a container or set as environment variables or files inside a container. This simplifies the assembly of e.g. the environment.js or .env files that our deploy.sh previously had to create.

I have some local documentation written down which describes how to deploy and so on. I need to upload that somewhere at some point.

TODO:

  • Set up the Kubernetes cluster on digitalocean
  • Connect with the local kubectl client to the cluster
  • Set up an nginx ingress controller on the cluster, so we can route network traffic to the pods
  • Set up the DNS entries at Cloudflare to forward the *-dev.ecamp3.ch domains to the cluster
  • Re-add and adapt the helm chart that API platform comes with (i.e. deploy the API)
  • Add the frontend to the deployment
  • Sending registration emails doesn't work because of twig complaining that it can't find the templates directory for the __main__ namespace or something? -> solved, had to re-add a line in the Dockerfile that was previously removed in a clean-up during the setup of API platform
  • Switch to digitalocean managed Postgres
  • Add the rest of the services
    • print (front page cannot be printed currently because of an unrelated bug, but everything else works)
    • mail
    • files
    • rabbitmq
    • worker puppeteer
  • Add a more strict securityContext by default, and run the processes inside the containers as non-root. End-goal would be to be able to deploy directly to OpenShift [1] [2]
  • Automatically restart pods that depend on environment variables during startup when these environment variables change. https://helm.sh/docs/howto/charts_tips_and_tricks/#automatically-roll-deployments
  • Create a POC feature branch deployment, using bitnami/external-dns to automate the DNS change at Cloudflare
    • Automatically create the database in the managed postgres DB on install, and drop it on uninstall
  • For automated deployments, create a GitHub action based on this one from API Platform: https://github.com/api-platform/demo/blob/main/.github/workflows/cd.yml#L172
    • The GitHub Action could be triggered by adding a label on a PR, or (re-)opening a PR with the label There is no way on GitHub to grant PRs from forks access to repository secrets, so we cannot directly trigger the deployment actions from events on the PR itself.
    • The GitHub Action is scheduled regularly, e.g. every 30 minutes. It fetches all open PRs with the deploy! label and all currently active deployments, and installs, upgrades and uninstalls any deployments that need it.
    • Dev should only be deployed if CI has passed for the commit that is deployed
    • Deployments should only be updated when there is a code change, but it should still be possible to deploy by re-adding a previously removed deploy! label
    • When removing the label or merging or closing the PR, the deployment should be uninstalled from the cluster
  • Caching of the docker builds somehow doesn't work as documented... works now!
  • Write more helm chart tests?
  • The GitHub Environments display shows mixed data from environments (used for segregating secrets) and deployments (used to indicate the status of deployments), and lists feature-branch alongside e.g. pr2343 as separate environments, even though the pr2353 deployment just happens to use the feature-branch environment. Not sure we can do anything about that...
  • On this PR, it still says "this branch has not been deployed" even though it has: https://pr2353.ecamp3.ch. But that might resolve itself once the workflow really runs on the ecamp/ecamp3 repo instead of carlobeltrame/ecamp3

Loading fixtures in production would require making alice a non-dev
dependency. And either way, prod fixtures are better covered using
database migrations: https://stackoverflow.com/a/47902192
Finishes what was started in commit
4c7d667
At least as long as we are running the container as root. These lines
were made obsolete when we started running the containers as non-root in
development. Maybe once we do the same in production, we'll need another
way to set the permissions correctly.
@BacLuc
Copy link
Contributor

BacLuc commented Dec 15, 2021

Kommt sich "Switch to digitalocean managed Postgres" nicht mit feature deployment in die Quere?
Oder würden wir bei feature deployments eine datenbank erstellen?

@carlobeltrame
Copy link
Member Author

Kommt sich "Switch to digitalocean managed Postgres" nicht mit feature deployment in die Quere? Oder würden wir bei feature deployments eine datenbank erstellen?

Ich hatte jetzt vorgehabt, die Datenbank während dem Deployment automatisch zu erstellen, da es dann weniger kostet weil wir pauschal für das managed Postgres zahlen (im Gegensatz zur grösseren nötigen Cluster-Grösse wenn wir für jeden Feature Branch einen zusätzlichen Postgres-Container deployen). Aber man kann auch pro Deployment einstellen, ob das managed Postgres oder ein extra-Container mit eigenem Postgres verwendet werden soll. Ist eine reine Konfigurationsfrage beim Helm Chart von API Platform: https://github.com/carlobeltrame/ecamp3/blob/devel/.helm/ecamp3/values.yaml#L75..L81
Somit konnte ich diesen Punkt heute auch abhaken, ohne Änderungen am Code zu machen. Ich musste nur die Argumente die ich Helm auf der Kommandozeile übergebe abändern.

…tically

This is a step towards feature branch deployments. Default is to create
the database automatically but not drop it automatically, just in case
someone does not know about this feature and wants to quickly re-install
ecamp3 on their cluster.
This is better than running it in the entrypoint script because when
there are long-running migrations, it's possible that the pod running
the migrations exceeds its liveness probe limit and is killed before
finishing the migrations.
https://itnext.io/database-migrations-on-kubernetes-using-helm-hooks-fb80c0d97805
@carlobeltrame carlobeltrame added the deploy! Creates a feature branch deployment for this PR label Jan 4, 2022
@carlobeltrame
Copy link
Member Author

carlobeltrame commented Jan 4, 2022

@ecamp/core Kubernetes and feature branch deployments are officially ready. The open TODOs are possible future improvements. It might be beneficial for discussing any concerns if you want to have a look before the meeting.

I have removed the GitHub Actions for the old deployment and for separately building and pushing the docker images, in favor of a single workflow file "continuous-deployment.yml". That file checks which PRs have a deploy! label and calculates which branches need to be newly deployed, upgraded or uninstalled. An example run of this workflow can be seen here in my fork: https://github.com/carlobeltrame/ecamp3/actions/runs/1654671254

The deployment is done using helm, which I have described a little above. Anyone who wants to deploy eCamp v3 to a Kubernetes cluster can download our helm chart and use the helm CLI to deploy it, even multiple times on the same cluster. I have written up some documentation in the wiki on how to do that.

All deployments use our managed postgres instance, and a database is created and migrated automatically when installing a new deployment, and the database is deleted automatically when uninstalling a deployment. The helm chart also supports running a postgres db in a container.

Copy link
Member

@usu usu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, very nice 😻 Thanks a lot!

A few comments (and mainly questions) below and in the code. But generally looks good to me.

Improvement ideas

  • Any possibility to indicate on the PR, after the deployment is successful (incl. URL to frontend?) (e.g. using https://github.com/marketplace/actions/comment-pull-request). Or the information is not visible on the PR because the actions are not yet running on our repository?
  • If I understand correctly, the workaround with checking every 30min is due to limitation in accessing secrets from PR branches. Thinkable that at a later stage we switch to branch-push for dev (and maybe later also for stage and prod)?

Questions

  • Stealing secrets: Theoretically, someone could add commits after the "deploy!" tag has been given and therefore include malicious code. Right?

.github/workflows/continuous-deployment.yml Show resolved Hide resolved
.github/workflows/continuous-deployment.yml Show resolved Hide resolved
.github/workflows/continuous-deployment.yml Outdated Show resolved Hide resolved
fi

if ls -A migrations/*.php >/dev/null 2>&1; then
php bin/console doctrine:migrations:migrate --no-interaction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something goes wrong during migration, would that be visible somewhere in the logs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The migrations are run in a separate pod (i.e. a php container is spun up only to run the migrations) every time we install or upgrade a release. This pod is configured in .helm/ecamp3/templates/hook_db_migrate.yaml. If that pod runs into an error, it will remain in the cluster. So in there, we'd be able to see the logs.
If the pod runs smoothly, it's removed from the cluster afterwards. We could choose to leave it there, but then our cluster would slowly fill up with these finished pods from old feature branch deployments... I think when uninstalling the feature branch deployment, Helm cannot automatically remove these old pods that were created by a hook. We'd have to add another command in continuous-deployment.yml after helm delete to accomplish that, and it could get forgotten if we ever manually delete a feature branch deployment.

api/docker/php/docker-entrypoint.sh Show resolved Hide resolved
api/docker/php/docker-entrypoint.sh Show resolved Hide resolved
.helm/ecamp3/templates/hook_db_drop.yaml Show resolved Hide resolved
@carlobeltrame
Copy link
Member Author

Any possibility to indicate on the PR, after the deployment is successful (incl. URL to frontend?) (e.g. using https://github.com/marketplace/actions/comment-pull-request). Or the information is not visible on the PR because the actions are not yet running on our repository?

I think that should sort itself once the PR is merged. Have a look on my fork, where the active GitHub deployments (Environments) are visible (even though they are kind of wrong): https://github.com/carlobeltrame/ecamp3
I expect that once the workflow runs on ecamp/ecamp3, the deployments will show up correctly, and also the PRs should lose their "This branch has not been deployed" comment. If not, I'd propose to debug that then, once we can actually test out things on the origin.

If I understand correctly, the workaround with checking every 30min is due to limitation in accessing secrets from PR branches. Thinkable that at a later stage we switch to branch-push for dev (and maybe later also for stage and prod)?

It is thinkable. Having a single large workflow that looks at all open labeled PRs will still be necessary, because the PRs from forks cannot trigger workflow runs with secrets (I mean they can, but the secrets aren't available). And due to the syncing mechanism, this large workflow also needs to know about dev or any other special deployments that should remain even without a labeled PR.
We could just add a trigger like

on:
  push:
    branches:
      - devel

to this large workflow, but I doubt we'd get a lot of speedup, because the whole workflow takes ~10-15mins due to the docker build cache not working correctly.
Another option would be to create a separate workflow for deploying special branches, which is only run on branch push. But still, the large feature branch syncing workflow would need to know about all these special deployments, so it doesn't uninstall them. (Or we could install dev, stage, prod into different namespaces on kubernetes, which might be a good solution). But for this first version, I wanted to avoid the yaml code duplication that this would imply.

Stealing secrets: Theoretically, someone could add commits after the "deploy!" tag has been given and therefore include malicious code. Right?

That is indeed a valid attack vector. I guess we need to be extra careful when deploying a PR from an unknown collaborator.

@usu
Copy link
Member

usu commented Jan 5, 2022

That is indeed a valid attack vector. I guess we need to be extra careful when deploying a PR from an unknown collaborator.

Ok, I see. So it makes extra sense to not share any secrets between PR/dev environment and stage/prod environment, so in worst case, an attacker would only have access to dev secrets.

@usu
Copy link
Member

usu commented Jan 5, 2022

Are both https://pr2353.ecamp3.ch/ and https://dev.ecamp3.ch/ supposed to work? When trying to login with test-user I receive a 401 response (invalid credentials) on pr2353 and a 500 response on dev.

@usu
Copy link
Member

usu commented Jan 7, 2022

  • Caching of the docker builds somehow doesn't work as documented...

I tried to dig into this one a bit. If I understood correctly, every build image (api, caddy, frontend, etc.) needs its own cache. Otherwise they invalidate each other.

It's not super well documented, but I think the scope property can be used for this. See the buildkit documentation and this example from an issue discusison in the build-push-action repo.

Another option might be to use the other described solution with actions/cache workflow and local cache. But also here, separate caches are needed for each build image. See a corresponding discussion and example here.

@carlobeltrame
Copy link
Member Author

Are both https://pr2353.ecamp3.ch/ and https://dev.ecamp3.ch/ supposed to work? When trying to login with test-user I receive a 401 response (invalid credentials) on pr2353 and a 500 response on dev.

The fixtures are deployed on neither, so the test-user will never work (unless you register it manually). I was able to register and log in normally with a new account on pr2353.
On dev, the database was in a broken state, probably a leftover from before I made sure to only deploy the correct version. Some migrations had been applied there which weren't present in the deployed version of the code, one of which was the one from #2241. I fixed it by manually deleting the dev environment and dropping the database, and letting GitHub re-deploy it.

@carlobeltrame
Copy link
Member Author

It's not super well documented, but I think the scope property can be used for this. See the buildkit documentation and this example from an issue discusison in the build-push-action repo.

Setting the scope was it! Previously, all workflow runs that actually did something took around 15min, now workflow runs with unchanged docker images take less than 2 minutes.

Copy link
Contributor

@BacLuc BacLuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meega cool. Hoffentlich hattest du noch bisschen festtage neben dem kubernetes deployment.

Copy link
Member

@manuelmeister manuelmeister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So cool!

@usu
Copy link
Member

usu commented Jan 18, 2022

Hab übrigens noch einen Bericht gefunden von einer, die auf die gleiche Problematik mit Pull-request + secrets gestossen ist, aber mit einem etwas anderen Ansatz gelöst hat. Für den Fall, dass wir mit den cron-jobs nicht warm werden:
https://blog.jupyter.org/how-i-automated-authorised-cloud-deployments-from-pull-requests-with-github-actions-13f890538e32
https://github.com/sgibson91/test-this-pr-action

Inspiriert von:
https://github.com/imjohnbo/ok-to-test

@carlobeltrame carlobeltrame mentioned this pull request Feb 3, 2022
27 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deploy! Creates a feature branch deployment for this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple deployments
4 participants