Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade from Amazon Linux 2 to 2022 (or 2023?) #1278

Open
BryanQuigley opened this issue Oct 28, 2022 · 8 comments
Open

Upgrade from Amazon Linux 2 to 2022 (or 2023?) #1278

BryanQuigley opened this issue Oct 28, 2022 · 8 comments

Comments

@BryanQuigley
Copy link
Contributor

BryanQuigley commented Oct 28, 2022

Detailed Description

This does not have any current urgency, but wanted to get these notes and context written down.

AWS has announced a new Amazon Linux release structure in Amazon Linux 2022. They also released an ECS version which is the variant of AL2 DB uses. The 2022 version are all in preview mode right now.

Context

  • It's the next version of Amazon Linux (and we use ECS optimized builds), so we are going to eventually want to move to it.
  • We can drop older cgroup support
  • General performance improvement including around memory and cgroup improvements
  • We can try items to help save memory using newer features - like zswap or zram (didn't work on AL2)

Alternatives

If they introduce a newer Fargate that is backed by this more modern OS - there may be a chance we can switch to Fargate instead. That is blocked on using the kernel option vm_max_map_count.

Possible Implementation

Changing to it is trivial, has been slightly tested: c780b2f

We don't necessarily need to wait until it's out of preview, but do want to a lot of performance testing/comparison to confirm it's better.

  1. Make fresh deployment on staging,
  2. Get fresh benchmark numbers / https://github.com/PublicMapping/districtbuilder/tree/develop/load-tests and memory usage
  3. Make terraform change to 2022 image in test branch so it's on staging
  4. Get 2022 image benchmark numbers and memory usage
  5. Decide if worth pursing for production move. If no, document. If yes, continue
  6. Make terraform change PR.
@KlaasH
Copy link
Collaborator

KlaasH commented Mar 9, 2023

Ok, an update on where this stands:

  • It's now called Amazon Linux 2023
  • It's still in preview. They released an RC0 some time in mid-late 2022 and they're now on RC3, released in February.
  • I don't see any indication that there's a concrete, or even not-concrete, timeline for finalizing it. Presumably they'll keep making RCs and fixing bugs until they feel it's ready for production.
  • The release notes and the FAQ say "Release Candidate is not recommended for production workloads."

I think it's in our interest to wait until they're fully done. The new OS might provide better performance or it might be about the same, but there's no reason to think it'll make such a significant difference that we want to get on board the moment it's available. Most of the work of this task will be load testing and evaluation, and doing that early, on a preview, is probably not a substitute for doing it on the actual final AMI. So it makes sense to do it once, when the thing is ready.

@BryanQuigley
Copy link
Contributor Author

@aaronxsu
Copy link
Collaborator

Hey @KlaasH , I saw the above message from Bryan that the newer version is now released. Would you suggest us to make the move or would you like to have a review of the above doc next sprint before a decision?

@KlaasH KlaasH removed their assignment Mar 28, 2023
@KlaasH
Copy link
Collaborator

KlaasH commented Mar 28, 2023

I don't have a very clear sense of how much performance improvement we expect from upgrading and how important it is that we capture it. The zswap thing sounds promising, but also sounds like it could be a trade-off since I assume we'd have to reserve a chunk of memory for that, so the amount of normal memory available would be reduced. I don't know the implications of the cgroup changes.

I think the first big question for me is whether it makes sense to upgrade our EC2 instances or if we're hoping to be able to upgrade to Fargate and think it will be possible soon enough that it doesn't make sense to spend time on an intermediate upgrade. The issue description above mentions that we need to be able to increase vm.max_map_count. Here's an issue where a number of people mention needing that parameter, and the latest update, from February, is that they're making progress (at least on the broader issue of sysctl support in general, no specific word on whether that parameter will be available, though I would think it's on their list to at least try to add, since it sounds like there's a common use case that requires it).

So yeah, we definitely could make the move now, but whether we should depends on how much we expect to gain from it, how important those gains are to the functionality or stability of the app, and whether we're likely to keep it for a while or change again soon.

@aaronxsu
Copy link
Collaborator

aaronxsu commented Mar 28, 2023

Thanks @KlaasH I was skeptical about the zswap bit as well, and I agree with your reasoning. I think we might have an understanding of the performance implication through some potential load tests when we have capacity to do load testing again? Regarding new version of Fargate with sysctl support (which seems like will unblock us from upgrading switching due to the need for configuring vm_max_map_count), did you have any luck finding this on their roadmap? Some light search did not help me much...

Hey @BryanQuigley could you expand on why we may drop the code block for reading control group memory max when determining the docker memory limit after this upgrade please? I think I am missing some context here.

@BryanQuigley
Copy link
Contributor Author

I don't understand how they plan to implement the sysctl kernel changes - but it sounds like it's a way off.

for reading control group memory max

The cgroup commit that can be changed is: 700d413

All our Linux/Mac machines are running using cgroupsv2 while our production/staging sites are using v1. There are numerous memory improvements with cgroupsv2 as well - https://docs.kernel.org/admin-guide/cgroup-v2.html#issues-with-v1-and-rationales-for-v2 that may match some of the issues with the agent being killed. Or in other words, I don't know if it's worth troubleshooting why it's happening on the old OS, when a new one is available.

@KlaasH
Copy link
Collaborator

KlaasH commented Aug 14, 2023

Update (sort of) re Fargate:
The issue I was watching on the AWS "containers roadmap" repo about adding sysctl parameter control to fargate (aws/containers-roadmap#460) was resolved, but without adding the one we need (max_map_count).

There's another issue, aws/containers-roadmap#1452, that's specific to that one parameter. There hasn't been much activity on it, but until three days ago the last comment was "#460 should handle this issue." Now that that hasn't happened, maybe it will get some attention. Then again maybe not—max_map_count was mentioned in the discussion for #460, so it's possible they concluded it's not feasible to expose that one. But hopefully they just decided to prioritize other ones but will be continuing to work on it. We shall see.

@aaronxsu
Copy link
Collaborator

Thanks @KlaasH for continuing looking out for it. Please keep us posted!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants