Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/usr/bin/amazon-efs-mount-watchdog - OSError: [Errno 28] No space left on device #154

Open
vparmeland opened this issue Jan 17, 2023 · 6 comments
Labels

Comments

@vparmeland
Copy link

vparmeland commented Jan 17, 2023

On our servers it happens regularly that the servers crash and are inaccessible via SSM / SSH. The only solution is to stop the server (sometimes it restarts normally, sometimes we have to destroy the server)

After investigation I found these elements that correspond with the unavailability of the servers

Jan 16 13:31:11 ip-XX-XX-XX-142 dhclient[3600]: XMT: Solicit on eth0, interval 120860ms.
Jan 16 13:33:12 ip-XX-XX-XX-142 dhclient[3600]: XMT: Solicit on eth0, interval 115990ms.
Jan 16 13:35:08 ip-XX-XX-XX-142 dhclient[3600]: XMT: Solicit on eth0, interval 129620ms.
Jan 16 13:37:18 ip-XX-XX-XX-142 dhclient[3600]: XMT: Solicit on eth0, interval 108240ms.
Jan 16 13:37:46 ip-XX-XX-XX-142 env: OSError: [Errno 28] No space left on device
Jan 16 13:37:46 ip-XX-XX-XX-142 env: During handling of the above exception, another exception occurred:
Jan 16 13:37:46 ip-XX-XX-XX-142 env: Traceback (most recent call last):
Jan 16 13:37:46 ip-XX-XX-XX-142 env: File "amazon-efs-mount-watchdog", line 2014, in
Jan 16 13:37:46 ip-XX-XX-XX-142 env: main()
Jan 16 13:37:46 ip-XX-XX-XX-142 env: File "/usr/bin/amazon-efs-mount-watchdog", line 2004, in main
Jan 16 13:37:47 ip-XX-XX-XX-142 env: unmount_count_for_consistency,
Jan 16 13:37:47 ip-XX-XX-XX-142 env: File "/usr/bin/amazon-efs-mount-watchdog", line 1005, in check_efs_mounts
Jan 16 13:37:47 ip-XX-XX-XX-142 env: rewrite_state_file(state, state_file_dir, state_file)
Jan 16 13:37:47 ip-XX-XX-XX-142 env: File "/usr/bin/amazon-efs-mount-watchdog", line 921, in rewrite_state_file
Jan 16 13:37:47 ip-XX-XX-XX-142 env: json.dump(state, f)
Jan 16 13:37:47 ip-XX-XX-XX-142 env: OSError: [Errno 28] No space left on device
Jan 16 13:37:47 ip-XX-XX-XX-142 systemd-udevd: fork of child failed: Cannot allocate memory
Jan 16 13:37:47 ip-XX-XX-XX-142 systemd: amazon-efs-mount-watchdog.service: main process exited, code=exited, status=1/FAILURE
Jan 16 13:37:47 ip-XX-XX-XX-142 systemd: Unit amazon-efs-mount-watchdog.service entered failed state.
Jan 16 13:37:47 ip-XX-XX-XX-142 systemd: amazon-efs-mount-watchdog.service failed.
Jan 16 13:38:27 ip-XX-XX-XX-142 systemd: amazon-efs-mount-watchdog.service holdoff time over, scheduling restart.
Jan 16 13:38:45 ip-XX-XX-XX-142 systemd: Stopped amazon-efs-mount-watchdog.
Jan 16 13:38:52 ip-XX-XX-XX-142 systemd: Started amazon-efs-mount-watchdog.
Jan 16 13:39:46 ip-XX-XX-XX-142 dhclient[3600]: XMT: Solicit on eth0, interval 126470ms.
Jan 16 13:41:45 ip-XX-XX-XX-142 dhclient[3600]: XMT: Solicit on eth0, interval 126770ms.
Jan 16 13:42:21 ip-XX-XX-XX-142 amazon-ssm-agent: runtime/cgo: pthread_create failed: Resource temporarily unavailable
Jan 16 13:42:51 ip-XX-XX-XX-142 amazon-ssm-agent: SIGABRT: abort
Jan 16 13:43:09 ip-XX-XX-XX-142 amazon-ssm-agent: PC=0x7fad8a3b4051 m=2 sigcode=18446744073709551610
Jan 16 13:43:16 ip-XX-XX-XX-142 amazon-ssm-agent: goroutine 0 [idle]:
Jan 16 13:43:49 ip-XX-XX-XX-142 amazon-ssm-agent: runtime: unknown pc 0x7fad8a3b4051
Jan 16 13:44:24 ip-XX-XX-XX-142 amazon-ssm-agent: stack: frame={sp:0x7fad63126bc0, fp:0x0} stack=[0x7fad62727678,0x7fad63127278)
Jan 16 13:43:49 ip-XX-XX-XX-142 amazon-ssm-agent: runtime: unknown pc 0x7fad8a3b4051
Jan 16 13:44:24 ip-XX-XX-XX-142 amazon-ssm-agent: stack: frame={sp:0x7fad63126bc0, fp:0x0} stack=[0x7fad62727678,0x7fad63127278)
Jan 16 13:50:40 ip-XX-XX-XX-142 journal: Runtime journal is using 8.0M (max allowed 1.5G, trying to leave 2.3G free of 15.5G available → current limit 1.5G).
Jan 16 13:50:40 ip-XX-XX-XX-142 kernel: Linux version 4.14.301-224.520.amzn2.x86_64 (mockbuild@ip-10-0-47-71) (gcc version 7.3.1 20180712 (Red Hat 7.3.1-15) (GCC)) #1 SMP Fri Dec 9 09:57:03 UTC 2022
Jan 16 13:50:40 ip-XX-XX-XX-142 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-4.14.301-224.520.amzn2.x86_64 root=UUID=a482dce8-a78a-42c8-931e-7a3bbdd3eb43 ro console=tty0 console=ttyS0,115200n8 net.ifnames=0 biosdevname=0 nvme_core.io_timeout=4294967295 rd.emergency=poweroff rd.shell=0
Jan 16 13:50:40 ip-XX-XX-XX-142 kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Jan 16 13:50:40 ip-XX-XX-XX-142 kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Jan 16 13:50:40 ip-XX-XX-XX-142 kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Jan 16 13:50:40 ip-XX-XX-XX-142 kernel: x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
Jan 16 13:50:40 ip-XX-XX-XX-142 kernel: x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.

Storage :

[ssm-user@ip-xxx-xxx-xxx-142 bin]$ df -H
Filesystem Size Used Avail Use% Mounted on
devtmpfs 17G 0 17G 0% /dev
tmpfs 17G 0 17G 0% /dev/shm
tmpfs 17G 521k 17G 1% /run
tmpfs 17G 0 17G 0% /sys/fs/cgroup
/dev/nvme0n1p1 275G 17G 259G 6% /
127.0.0.1:/ 9.3E 56G 9.3E 1% /experiments
tmpfs 3.4G 0 3.4G 0% /run/user/1000

@RyanStan
Copy link
Member

RyanStan commented Jan 23, 2023

Hi @nk74, can you run a ls /var/run/efs?

I found this online:

[OSError: [Errno 28] No space left on device] error will be triggered in any situation in which the data or the metadata associated with an I/O operation can't be written down anywhere because of lack of space.

I'm wondering if perhaps there's too many files written to this directory.

@vparmeland
Copy link
Author

vparmeland commented Jan 23, 2023

On each instances same result :

fs-xxxx.xxxxx.20749  fs-xxxx.xxxx.20749+  stunnel-config.fs-xxxx.xxxxx.20749
fs-xxx.xxxxx.20430  fs-xxx.xxxxx.20430+  stunnel-config.fs-xxx.xxxx.20430

@RyanStan
Copy link
Member

I'm wondering if this is related to "Old EFS Certificates not removed", I'm worried that if you have a long running mount, it could be taking up too many inodes due to these certs not getting cleaned up.

Two things:

  1. Can you run df -i to see how many free inodes you have?
  2. Can we see how many files are in certs? ls -l | wc -l fs-xxxx.xxxx.20749+/certs

@vparmeland
Copy link
Author

Inodes (instance A & B)

~% df -i
Filesystem       Inodes  IUsed    IFree IUse% Mounted on
devtmpfs        4056041    314  4055727    1% /dev
tmpfs           4058035      1  4058034    1% /dev/shm
tmpfs           4058035    433  4057602    1% /run
tmpfs           4058035     16  4058019    1% /sys/fs/cgroup
/dev/nvme0n1p1 41941952 626071 41315881    2% /
127.0.0.1:/           0      0        0     - /efs
tmpfs           4058035      1  4058034    1% /run/user/1000
df -i
Filesystem        Inodes  IUsed     IFree IUse% Mounted on
devtmpfs         4067049    314   4066735    1% /dev
tmpfs            4069043      1   4069042    1% /dev/shm
tmpfs            4069043    460   4068583    1% /run
tmpfs            4069043     16   4069027    1% /sys/fs/cgroup
/dev/nvme0n1p1 134216640 545901 133670739    1% /
127.0.0.1:/            0      0         0     - /efs
tmpfs            4069043      1   4069042    1% /run/user/1000
tmpfs            4069043      1   4069042    1% /run/user/0

Certs

% ls -l | wc -l fs-xxxx.xxxxx.20369+/certs
wc: fs-xxxxx.xxxxxx.20369+/certs: No such file or directory
/var/run/efs% sudo ls -al fs-xxxx.xxxxxx.21004+/certs
total 4
drwxr-x--- 2 root root   60 Jan 24 08:58 .
drwxr-xr-x 4 root root  180 Jan 24 08:58 ..
-rw-r--r-- 1 root root 3317 Jan 24 08:58 00.pem
/var/run/efs% cd /var/run/efs
sudo ls -al fs-xxxx.xxxxx.20369+/certs
ls: cannot access fs-xxxx.xxxxx.20369+/certs: No such file or directory
/var/run/efs% sudo ls -al fs-xxxx.xxxxx.20369+/certs
ls: cannot access  fs-xxxx.xxxxx.20369+/certs: No such file or directory

@RyanStan
Copy link
Member

RyanStan commented Jan 25, 2023

Looks like there's plenty of free inodes. I'll need to dig deeper into why that json.dump method may be throwing that error. We may need to add some logging output to see the size of the file that json dump is attempting to write, and if there is a bug occurring that would cause it to become oversized.

@RyanStan
Copy link
Member

RyanStan commented May 12, 2023

Have you run into this lately? As part of the 1.35.0 release, we put in a debug line to track the size of the state file that we write to disk, which is the line that we saw crashing earlier in your log (the json.dump).

You can enable debug logging with sed -i '/logging_level = INFO/s//logging_level = DEBUG/g' /etc/amazon/efs/efs-utils.conf. Then if the crash happens again, we can see the size of the file that we were attempting to write. The log is stored at /var/log/amazon/efs/mount.log

@RyanStan RyanStan added the bug label May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants