Add autoscaling policy to add an instance when load is high #1317

KlaasH · 2023-08-16T17:43:36Z

We have an autoscaling policy to reduce the instance count when the site is not heavily loaded, but we don't have one to increase capacity when it is.
This is causing downtime, since we always have worker churn within the app tasks, and when the load gets higher the chances that all 4 workers will be down at the same time increases. When that happens the health check fails and ECS replaces the task, but since we only have one EC2 instance, it can't start a new task before stopping the old one. So we end up with the service down entirely while it makes the switch (also it seems like we might be waiting on some sort of timeout or cooldown, because that usually takes a little more than an hour, which is longer than it should).

Increasing the instance count when the load is high, and keeping the desired task count in ECS permanently high, should make it so that a new instance will come up and, hopefully, a new task will be running on it by the time the existing task gets killed due to health check failure. Or possibly it would reduce the chances of health check failure in the first place by taking some of the load. In any case, it seems worth doing.

ddohler · 2023-08-16T18:45:46Z

This jogged a memory -- I'm not sure how functional the existing scale-down rule actually is. I took another look at it, and it scales down if CPUReservation < 100 for 6 minutes, but that CPUReservation metric doesn't measure application load -- it just measures the total requested vCPUs of all the running tasks in the cluster. Currently, it's always 100 unless we manually scale because our tasks request exactly as many vCPUs as the hosts have.

In other words, the rule will scale down if the number of of tasks doesn't "fill up" the available vCPUs of the running hosts, but it otherwise won't trigger. So I think its practical effect is more to ensure that the number of hosts isn't overprovisioned for the number of tasks we're running, but I don't think it'll adjust the number of running tasks for us. (The reason this jogged a memory is there was a period when it was always trying to scale down because the number of CPUs requested by the tasks didn't fit evenly into the available CPUs on the hosts so the metric was always below 100%)

To scale up or down the Tasks we'll need Service Auto Scaling and it doesn't look like we have either scale-up or scale-down rules set up for the service currently:

So I think we actually need three new rules:

A scale-up rule for hosts so that if we try to provision more tasks than we have hosts, we get more hosts
A scale-up rule for tasks so that we get more if CPU load is high
A scale-down rule for tasks so that we get fewer if CPU load drops

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add autoscaling policy to add an instance when load is high #1317

Add autoscaling policy to add an instance when load is high #1317

KlaasH commented Aug 16, 2023

ddohler commented Aug 16, 2023

Add autoscaling policy to add an instance when load is high #1317

Add autoscaling policy to add an instance when load is high #1317

Comments

KlaasH commented Aug 16, 2023

ddohler commented Aug 16, 2023