Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Old EFS certificates not removed #124

Open
ballock opened this issue Mar 21, 2022 · 4 comments
Open

Old EFS certificates not removed #124

ballock opened this issue Mar 21, 2022 · 4 comments

Comments

@ballock
Copy link

ballock commented Mar 21, 2022

We are running an EC2 instance with 512MB memory with 3 EFS mounts, using the EFS helper.

After 6 months of instance's uptime, the machine failed the mounts and got a number of issues caused by full /run filesystem.
du shows
13556 ./fs-3ac8f8f3.efs.ROTATED_OUT.20137+/certs
certs# ls -l |wc -l
3390

The directory holds hourly certificates for the last 6 months. There are 3 EFS mounts on the machine, so all of those filled up the 47MB /run filesystem that is there.

Please implement garbage collector for the certs.

OS: Debian 10.11 (has /var/run symlinked to /run on ramfs)
EFS helper version: 1.30.2 (but I checked the latest doesn't have cleanup, either)

@Cappuccinuo
Copy link
Contributor

Hey,

Thanks for the report.

The certs are stored on /var/run/efs, instead of /run, from the log you posted, the certs take 13556KB, which is 13MiB. Can you double confirm that the efs-utils pem is causing system issue?

We do have cleanup logic running in our watchdog (https://github.com/aws/efs-utils/blob/master/src/watchdog/__init__.py#L824-L834). If the file system is umounted and then mount again, the certs should be cleaned up then. Can you elaborate on the failed the mount part, is that mount failed, or the mount is not making any progress? If the folder cannot be cleaned up, can you

  1. Make sure the watchdog is running by using systemctl status amazon-efs-mount-watchdog

  2. Turn on the debug log by modify the config file(/etc/amazon/efs/efs-utils.conf) item to logging_level = DEBUG, restart the watchdog process, and see whether there is error when removing those mount state dir?

@ballock
Copy link
Author

ballock commented Mar 30, 2022

The certs are stored on /var/run/efs, instead of /run,

At least on Debian-based systems, /var/run is a symlink to /run. Thus, /var/run/efs is effectively in the /run tmpfs filesystem.

from the log you posted, the certs take 13556KB, which is 13MiB.

That's correct for one filesystem mount. I have 3 EFS mounts on the machine, and together with some system files normally in /run they take up all 47MB that are available on the 512MB memory machine.

This is output from the current run. It's after cleaning up the pem files. You can see it can house up to 39M of more pem files.

admin@VM:~$ df -h /var/run
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            47M  7.7M   39M  17% /run
admin@VM:~$ free
              total        used        free      shared  buff/cache   available
Mem:         476032      159396       68440        7884      248196      296536
Swap:             0           0           0

Can you double confirm that the efs-utils pem is causing system issue?

Yes.

We do have cleanup logic running in our watchdog (https://github.com/aws/efs-utils/blob/master/src/watchdog/__init__.py#L824-L834). If the file system is umounted and then mount again, the certs should be cleaned up then. Can you elaborate on the failed the mount part, is that mount failed, or the mount is not making any progress?

I guess I was ambiguous. There was no umount attempt on the current 3 EFS mounts there. These were running for 6 months, after which the EFS watchdog failed to create new pem certificates, and couldn't fetch them to stunnel. Stunnel failed to re-establish the link, and the mounts became stale. It was also impossible to re-mount the EFS mounts.

I guess you can reproduce the problem by filling up /run on a Debian machine with random data and waiting for another re-keying attempt from the EFS watchdog.

@Cappuccinuo
Copy link
Contributor

Thanks, got your point.

While we have someone investigating the issue, can you for now unmount the file system on a monthly frequency so that watchdog can clean up the state file directory?

@ballock
Copy link
Author

ballock commented Apr 6, 2022

Thanks for taking this seriously. I'll work around the issue for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants