Lower ECS Agent cleanup interval #1310

KlaasH · 2023-06-13T17:57:45Z

Per this discussion on Slack and this documentation, at least some of the 6/12-6/13/2023 instability seems to have been caused by the EC2 instances being out of disk space. I turns out the containers take about 5GB each, and they get cleaned up by a process that checks every 30 minutes and deletes any that are older than 3 hours. So that means they could stick around for up to 3.5 hours, and when there's a lot of crashing happening, we'll end up exceeding the 10 images it takes to fill up the disk.

We should lower the ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION and ECS_IMAGE_CLEANUP_INTERVAL parameters for the ECS agent. If we set them to 50 and 10 minutes, respectively, we could handle anything over a 6 minute average task duration (which might be lower than what's actually possible anyway, based on health check grace periods etc). Update: I just two tasks die after 4 minutes then 2 minutes, so to remove disk space as a factor, we would probably want to make the CLEANUP_WAIT_DURATION quite short.

KlaasH mentioned this issue Jun 23, 2023

Increase instance size to improve stability #1311

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower ECS Agent cleanup interval #1310

Lower ECS Agent cleanup interval #1310

KlaasH commented Jun 13, 2023 •

edited

Lower ECS Agent cleanup interval #1310

Lower ECS Agent cleanup interval #1310

Comments

KlaasH commented Jun 13, 2023 • edited

KlaasH commented Jun 13, 2023 •

edited