Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add autoscaling policy to add an instance when load is high #1317

Open
KlaasH opened this issue Aug 16, 2023 · 1 comment
Open

Add autoscaling policy to add an instance when load is high #1317

KlaasH opened this issue Aug 16, 2023 · 1 comment

Comments

@KlaasH
Copy link
Collaborator

KlaasH commented Aug 16, 2023

We have an autoscaling policy to reduce the instance count when the site is not heavily loaded, but we don't have one to increase capacity when it is.
This is causing downtime, since we always have worker churn within the app tasks, and when the load gets higher the chances that all 4 workers will be down at the same time increases. When that happens the health check fails and ECS replaces the task, but since we only have one EC2 instance, it can't start a new task before stopping the old one. So we end up with the service down entirely while it makes the switch (also it seems like we might be waiting on some sort of timeout or cooldown, because that usually takes a little more than an hour, which is longer than it should).

Increasing the instance count when the load is high, and keeping the desired task count in ECS permanently high, should make it so that a new instance will come up and, hopefully, a new task will be running on it by the time the existing task gets killed due to health check failure. Or possibly it would reduce the chances of health check failure in the first place by taking some of the load. In any case, it seems worth doing.

@ddohler
Copy link
Collaborator

ddohler commented Aug 16, 2023

This jogged a memory -- I'm not sure how functional the existing scale-down rule actually is. I took another look at it, and it scales down if CPUReservation < 100 for 6 minutes, but that CPUReservation metric doesn't measure application load -- it just measures the total requested vCPUs of all the running tasks in the cluster. Currently, it's always 100 unless we manually scale because our tasks request exactly as many vCPUs as the hosts have.

Screenshot from 2023-08-16 14-38-14

In other words, the rule will scale down if the number of of tasks doesn't "fill up" the available vCPUs of the running hosts, but it otherwise won't trigger. So I think its practical effect is more to ensure that the number of hosts isn't overprovisioned for the number of tasks we're running, but I don't think it'll adjust the number of running tasks for us. (The reason this jogged a memory is there was a period when it was always trying to scale down because the number of CPUs requested by the tasks didn't fit evenly into the available CPUs on the hosts so the metric was always below 100%)

To scale up or down the Tasks we'll need Service Auto Scaling and it doesn't look like we have either scale-up or scale-down rules set up for the service currently:
Screenshot from 2023-08-16 14-40-05

So I think we actually need three new rules:

  • A scale-up rule for hosts so that if we try to provision more tasks than we have hosts, we get more hosts
  • A scale-up rule for tasks so that we get more if CPU load is high
  • A scale-down rule for tasks so that we get fewer if CPU load drops

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants