Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In pods with multiple EFS PVC mounts, the EFS-CSI driver seems to sometimes mix up mount points #635

Closed
jgoeres opened this issue Feb 17, 2022 · 18 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@jgoeres
Copy link

jgoeres commented Feb 17, 2022

/kind bug

Hi all,

we are using EFS CSI driver v1.3.2 on EKS v1.21.5.

We are running a big data app where significant parts of the number crunching logic has been implemented with Spark 3.1.1. We use the Spark operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) v1.2.1-3.0.0 (the issue described here also occurs with 1.2.3-3.1.1) to run the Spark jobs.
The Spark operator defines a CRD "SparkApplication". In our application pods, whenver we need to run a number crunching job, we create a SparkApplication instance, which the Spark operator will then see and create a Spark driver pod. The driver itself then creates one more more executor pods which do most of the actual work.

In our Spark applications, we require access to two different PVCs (e.g., A and B) which are both provisioned through the EFS-CSI driver and which we mount into the Spark pods (say, to mount points /a and /b). To make this happen, we add the corresponding volumes and volume mounts definitions to the SparkApplication objects.
(As a side note, due to some technical constraints inside the Spark operator, when it creates the driver pod, it cannot add these volumes and volume mounts to the driver pod spec directly. Instead, it defines a mutating webhook to intercept the (its own!) pod creation calls and add the volumes this way.)

We run roughly 500 of these Spark jobs every night.

What we now observe for a while now is that on average about one of these jobs fails each night (sometimes a few more, in some nights none). When searching for the cause of the failure, we found that the two volumes are mounted into the wrong location, i.e., PVC A would be mounted to /b and PVC B to /a.

Our first guess was that something with the whole Spark (operator) volume handling was wrong. At first we of course blamed ourselves and added a bit of logging to our own app to see in detail how the SparkApplication objects that we create look like. However, this logging confirmed that we created the SparkApplication objects properly (i.e, A => /a, B => /b). The next thing we considered was that maybe the whole webhook magic of the Spark operator had an issue. So we added a watch for pod creation events to check if the driver and/or executor pods are created with "flipped" volume mounts by the Spark operator. But the pod definitions were also as they should be.

Then we noticed that the issues started around the time when we switched from the old EFS provisioner to the EFS CSI driver.
So we started looking at the logs of the efs-csi-controller and efs-csi-node pods. Controllers looked unsuspicious (they barely log anything), but we found a lot of error messages in the efs-csi-node pods. Interestingly, these error messages tend to happen more or less all the time, not only when the pods with the "mismounted" volumes are starting. We also see them on efs-csi-node pods that are on nodes where the Spark jobs are not running (we have a dedicated node group exclusively for the Spark pods). Alas, we couldn't make much sense of the logs, below is a pretty representative excerpt:

E0217 00:26:30.830919       1 driver.go:103] GRPC error: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/de5eda52-6866-4b20-b883-d51e2352e027/volumes/kubernetes.io~csi/pvc-3442cb2f-35fe-460c-9d39-a0e3e58c122a/mount": unmount failed: waitid: no child processes
Unmounting arguments: /var/lib/kubelet/pods/de5eda52-6866-4b20-b883-d51e2352e027/volumes/kubernetes.io~csi/pvc-3442cb2f-35fe-460c-9d39-a0e3e58c122a/mount
Output: 
E0217 00:29:10.065894       1 mount_linux.go:150] Mount failed: wait: no child processes
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-0107eb66a3b257f7e,tls fs-5098120b:/ /var/lib/kubelet/pods/21589d39-2369-4dde-acbd-425dd91da1de/volumes/kubernetes.io~csi/pvc-278df91d-9579-403b-8c88-a3608f8272f6/mount
Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"

E0217 00:29:10.065957       1 driver.go:103] GRPC error: rpc error: code = Internal desc = Could not mount "fs-5098120b:/" at "/var/lib/kubelet/pods/21589d39-2369-4dde-acbd-425dd91da1de/volumes/kubernetes.io~csi/pvc-278df91d-9579-403b-8c88-a3608f8272f6/mount": mount failed: wait: no child processes
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-0107eb66a3b257f7e,tls fs-5098120b:/ /var/lib/kubelet/pods/21589d39-2369-4dde-acbd-425dd91da1de/volumes/kubernetes.io~csi/pvc-278df91d-9579-403b-8c88-a3608f8272f6/mount
Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"
E0217 00:45:27.978968       1 driver.go:103] GRPC error: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/de3d6089-cb21-4379-8d0b-0bd6de633583/volumes/kubernetes.io~csi/pvc-3442cb2f-35fe-460c-9d39-a0e3e58c122a/mount": unmount failed: wait: no child processes
Unmounting arguments: /var/lib/kubelet/pods/de3d6089-cb21-4379-8d0b-0bd6de633583/volumes/kubernetes.io~csi/pvc-3442cb2f-35fe-460c-9d39-a0e3e58c122a/mount
Output: 
E0217 00:46:45.178755       1 driver.go:103] GRPC error: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/c8cd133e-1cd4-48b5-b779-ce17b29927bb/volumes/kubernetes.io~csi/pvc-278df91d-9579-403b-8c88-a3608f8272f6/mount": unmount failed: wait: no child processes
Unmounting arguments: /var/lib/kubelet/pods/c8cd133e-1cd4-48b5-b779-ce17b29927bb/volumes/kubernetes.io~csi/pvc-278df91d-9579-403b-8c88-a3608f8272f6/mount
Output: 
E0217 00:50:28.226563       1 driver.go:103] GRPC error: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/4d2996de-7158-4610-a15d-b416f1b91461/volumes/kubernetes.io~csi/pvc-278df91d-9579-403b-8c88-a3608f8272f6/mount": unmount failed: waitid: no child processes
Unmounting arguments: /var/lib/kubelet/pods/4d2996de-7158-4610-a15d-b416f1b91461/volumes/kubernetes.io~csi/pvc-278df91d-9579-403b-8c88-a3608f8272f6/mount
Output: 
E0217 01:04:15.403109       1 driver.go:103] GRPC error: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/0f9f1a2e-2b8d-48a0-b1a7-70e993ddfea5/volumes/kubernetes.io~csi/pvc-3442cb2f-35fe-460c-9d39-a0e3e58c122a/mount": unmount failed: wait: no child processes
Unmounting arguments: /var/lib/kubelet/pods/0f9f1a2e-2b8d-48a0-b1a7-70e993ddfea5/volumes/kubernetes.io~csi/pvc-3442cb2f-35fe-460c-9d39-a0e3e58c122a/mount
Output: 
E0217 01:11:19.014288       1 mount_linux.go:150] Mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t efs -o accesspoint=fsap-02e9c79694e5c8ba4,tls fs-5098120b:/ /var/lib/kubelet/pods/2a7274ec-55d6-45ba-83e1-725abfeca83e/volumes/kubernetes.io~csi/pvc-3442cb2f-35fe-460c-9d39-a0e3e58c122a/mount
Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/lib64/python3.7/threading.py", line 926, in _bootstrap_inner
	self.run()
  File "/lib64/python3.7/threading.py", line 870, in run
	self._target(*self._args, **self._kwargs)
  File "/sbin/mount.efs", line 1051, in poll_tunnel_process
	test_tunnel_process(tunnel_proc, fs_id)
  File "/sbin/mount.efs", line 1041, in test_tunnel_process
	% (tunnel_proc.returncode, out.strip(), err.strip()))
AttributeError: 'NoneType' object has no attribute 'strip'

b'mount.nfs4: mounting 127.0.0.1:/ failed, reason given by server: No such file or directory'

Could anyone provide any indication of what could be going on here?
Is this a known issue? I searched through the current issues but couldn't find one, which comes as not too big a surprise, since having more than one mount and have these mounted into pods that are launched many times a day, is probably a very special use case.
Any indications on how to proceed with finding the root cause? Can we, e.g., increase the log level to see each and every mount operation in more detail?

Regards

J

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 17, 2022
@617m4rc
Copy link

617m4rc commented Feb 18, 2022

Sounds pretty much like #282 doesn't it?

@bnapalm
Copy link

bnapalm commented Feb 22, 2022

We have experienced a similar bug multiple times, although I am not 100% sure if it is the exact same as described here, and is more in line with the behaviour described in the closed issue #282

In our case, I noticed while browsing via shell on one of the pods, that the same volume was actually mounted in 2 places, whereas those mountpoints should have had different mounts (one of them was correct and the other was the wrong volume)

Our workload writes files in dated folders, so I could see that this issue has been recurred multiple times for a few months, by finding dated files from months back in wrong volumes, but has gone unnoticed most of the times.

I don't have logs and we usually only notice this after the fact when some file read causes a mysterious "file not found" error, or some data suddenly appears in the wrong volume. It happens rarely and at random times, to random pods, while other pods from the same deployment/replicaset are working as expected.

What I have noticed is this - inside the pod, these volumes are mounted as nfs4 mounts from localhost with random port numbers. When the issue happens, I observed that multiple mounts, which should be different volumes, are mounted with the exact same port number, and those are also the volumes which have been mixed up. My uneducated guess would be that the driver assigns a random port number but it does not check for conflicting ports (or if it does, this check fails). I might be wrong, but I suspect it still has to do something with the ports being assigned the same.

I think this is potentially a very serious issue, as some workloads might be deleting data from efs volumes, and deleting data from the wrong volume would be a serious problem.

@jgoeres I would suggest checking the port numbers when you observe this bug, to see if the behaviour is the same.

@jgoeres
Copy link
Author

jgoeres commented Mar 21, 2022

Since our setup described in my initial post is a bit involved, we tried to build a simple reproducer. We create configurable number of PVCs via an EFS-CSI-driver-backed storage class. A "controller" pod mounts these and puts a unique marker file in each. It then proceeds to launch "validator" pods in rapid succession. The validators check if the marker files are in the expected share. If they are, they report success to the controller and exit (to be restarted a second later by the controller), if not, they report an error to the controller and stay running (the idea being that this would allow us to exec into the pod and check for the root cause).
Funny enough, this did not reproduce the problem (despite launching many more pods and having more than just two mounts). We even made the validators write lots of data to the share to see if it appears under load - no difference.
In our production setup (with Spark) the issue still appears. We meanwhile have three mounts per pod. And as expected, the problem tends to appear more often now.
Any pointers on where to proceed to analyze the issue (resp. which logs to provide here, log levels to increase etc.)?

@bnapalm
Copy link

bnapalm commented Mar 21, 2022

I find it puzzling that you don't observe the problem in your synthetic test setup..

Are the "validator" pods being ran one by one? I have no evidence of this, but maybe the issue is more likely to appear if many pods mount shares? At least in our case, the pods experiencing this employ HPA, so they are often scaled from 3 to a few dozen and back. Is your production spark workload similar?

Maybe you can modify your test to run as a deployment with a few dozen replicas. The ones that "succeed" and exit, would get re-launched as new pods by the replicaset to keep the desired replica account. If you define the marker files as known beforehand (like a filename "a" on the volume meant to be mounted to /mnt/a/ or something like that), you could probably even skip the controller - you would be able to tell a pod has the bug just by the running time - but I don't know your setup, you might still want an alert in one way or another.

If many parallel pods doesn't reproduce the issue, another edge-case one could try is set node affinity or anti-affinity for the parallel pods. Maybe the issue is more likely with multiple parallel pods with the same EFS mounts running on the same instance. Or maybe the other way around - if they are on different hosts. Again, I am just guessing here.

@jgoeres
Copy link
Author

jgoeres commented Mar 28, 2022

In the meantime, the reproducer we built manages to actually reproduce the issue. I had the controller permanently (re-)creating 30 "validator" pods and let that run since Friday evening. When I checked today, the issue had occurred 9 times out of 145358 launched pods (each mounting the same 5 PVCs into directories /a to /e).
I will pick one occurrence, validator-5. This is the contents of the 5 mounts as observed and logged by the validator:

Contents of directory /a
  /a/marker_a
Contents of directory /b
  /b/marker_b
Contents of directory /c
  /c/marker_a
Contents of directory /d
  /d/marker_d
Contents of directory /e
  /e/marker_e

As you can see, directory /c that should have a marker file marker_c instead contains marker_a.
Whenever a validator runs into that problem, it tells the coordinator about it. The coordinator will then dump the last hour of logs of all containers in all relevant pods of the EFS CSI driver and writes them to disk (actually the location is another share that is only mounted by the coordinator). "Relevant pods" here means the two instances of the controller deployment, and the pod of efs-csi-node daemonset on the node on which the affected validator was running. Interestingly, the efs-csi-controller pods rarely log anything at all (only their liveness-probe container), while the efs-csi-node pods generally shows a lot of errors out of which we couldn't make too much sense resp. which we couldn't really correlate with the actual problem that the validator observed. The same logs messages also seem to appear on all efs-csi-node pods, also on nodes where the issue did not occur.
A few example error messages, this one is usually repeated in "bursts" several times within a few milliseconds:

2022-03-27T12:17:06.812380740Z E0327 12:17:06.812231       1 mount_linux.go:150] Mount failed: exit status 32
2022-03-27T12:17:06.812410161Z Mounting command: mount
2022-03-27T12:17:06.812413727Z Mounting arguments: -t efs -o accesspoint=fsap-08e9cdd468b96df59,tls fs-xxxxxxxx:/ /var/lib/kubelet/pods/f354fc99-d404-430e-be7c-d77ba2a6cb6f/volumes/kubernetes.io~csi/pvc-00fb90e1-5891-4e1d-814e-f487ffb38998/mount
2022-03-27T12:17:06.812416825Z Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"
2022-03-27T12:17:06.812419735Z Exception in thread Thread-1:
2022-03-27T12:17:06.812421993Z Traceback (most recent call last):
2022-03-27T12:17:06.812424264Z   File "/lib64/python3.7/threading.py", line 926, in _bootstrap_inner
2022-03-27T12:17:06.812426677Z     self.run()
2022-03-27T12:17:06.812428808Z   File "/lib64/python3.7/threading.py", line 870, in run
2022-03-27T12:17:06.812431082Z     self._target(*self._args, **self._kwargs)
2022-03-27T12:17:06.812433340Z   File "/sbin/mount.efs", line 1051, in poll_tunnel_process
2022-03-27T12:17:06.812447056Z     test_tunnel_process(tunnel_proc, fs_id)
2022-03-27T12:17:06.812449968Z   File "/sbin/mount.efs", line 1041, in test_tunnel_process
2022-03-27T12:17:06.812452246Z     % (tunnel_proc.returncode, out.strip(), err.strip()))
2022-03-27T12:17:06.812454551Z AttributeError: 'NoneType' object has no attribute 'strip'
2022-03-27T12:17:06.812456779Z
2022-03-27T12:17:06.812458838Z b'mount.nfs4: Connection refused'

Just as mounting fails, unmount does, too:

2022-03-27T12:11:26.143588884Z E0327 12:11:26.143469       1 driver.go:103] GRPC error: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/3ffb9a49-6c24-4bdd-91f5-b46848532f99/volumes/kubernetes.io~csi/pvc-efe1a6d0-6cfe-4180-9931-ab098c2c44cc/mount": unmount failed: waitid: no child processes
2022-03-27T12:11:26.143613860Z Unmounting arguments: /var/lib/kubelet/pods/3ffb9a49-6c24-4bdd-91f5-b46848532f99/volumes/kubernetes.io~csi/pvc-efe1a6d0-6cfe-4180-9931-ab098c2c44cc/mount
2022-03-27T12:11:26.143617830Z Output:

This second type of error also sometimes comes with an additional message:

2022-03-27T11:34:19.180656638Z E0327 11:34:19.180495       1 driver.go:103] GRPC error: rpc error: code = Internal desc = failed to check if volume is mounted: could not get consistent content of /proc/mounts after 3 attempts
2022-03-27T11:34:19.223541283Z E0327 11:34:19.223421       1 driver.go:103] GRPC error: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/445e64b9-a7f2-4322-894d-2ca016e027e2/volumes/kubernetes.io~csi/pvc-791253b1-cc8f-4694-ad9c-0ca5eae4ecb4/mount": unmount failed: wait: no child processes
2022-03-27T11:34:19.223567312Z Unmounting arguments: /var/lib/kubelet/pods/445e64b9-a7f2-4322-894d-2ca016e027e2/volumes/kubernetes.io~csi/pvc-791253b1-cc8f-4694-ad9c-0ca5eae4ecb4/mount
2022-03-27T11:34:19.223571035Z Output:

Another observation is that of the four nodes onto which the validator pods could be scheduled, 8 out of 9 validators that ran into the problem had been scheduled on the same node:

validator-11   0/1     Running    0   2d5h    10.0.142.28    ip-10-0-142-51.eu-central-1.compute.internal   
validator-13   0/1     Running    0   27h     10.0.143.235   ip-10-0-142-51.eu-central-1.compute.internal   
validator-15   0/1     Running    0   2d12h   10.0.142.69    ip-10-0-142-51.eu-central-1.compute.internal   
validator-17   0/1     Running    0   2d12h   10.0.143.82    ip-10-0-142-51.eu-central-1.compute.internal   
validator-19   0/1     Running    0   30h     10.0.141.1     ip-10-0-142-51.eu-central-1.compute.internal   
validator-2    0/1     Running    0   2d10h   10.0.140.204   ip-10-0-142-51.eu-central-1.compute.internal   
validator-23   0/1     Running    0   2d9h    10.0.142.86    ip-10-0-142-51.eu-central-1.compute.internal   
validator-29   0/1     Running    0   8h      10.0.144.246   ip-10-0-145-52.eu-central-1.compute.internal   
validator-5    0/1     Running    0   27h     10.0.140.151   ip-10-0-142-51.eu-central-1.compute.internal   

It is also interesting that the error messages sometimes occur in the seconds before the error is reported by the validator, while sometimes the last log statement of the efs-csi-node pod happened several minutes before the incident.
These are the log files the coordinator dumped for the incident of validator-5 (naming scheme is [nameOfValidator]-[nameOfEfsPod]-[nameOfContainer].log):

 bash-4.4# ls -la validator-5*
 -rw-r--r-- 1 50062 50062      0 Mar 27 12:30 validator-5_20220327_123016_efs-csi-controller-77595f6b9f-7hhl6_csi-provisioner.log
 -rw-r--r-- 1 50062 50062      0 Mar 27 12:30 validator-5_20220327_123016_efs-csi-controller-77595f6b9f-7hhl6_efs-plugin.log
 -rw-r--r-- 1 50062 50062  41400 Mar 27 12:30 validator-5_20220327_123016_efs-csi-controller-77595f6b9f-7hhl6_liveness-probe.log
 -rw-r--r-- 1 50062 50062      0 Mar 27 12:30 validator-5_20220327_123016_efs-csi-controller-77595f6b9f-vcb2q_csi-provisioner.log
 -rw-r--r-- 1 50062 50062      0 Mar 27 12:30 validator-5_20220327_123016_efs-csi-controller-77595f6b9f-vcb2q_efs-plugin.log
 -rw-r--r-- 1 50062 50062  41400 Mar 27 12:30 validator-5_20220327_123016_efs-csi-controller-77595f6b9f-vcb2q_liveness-probe.log
 -rw-r--r-- 1 50062 50062      0 Mar 27 12:30 validator-5_20220327_123016_efs-csi-node-lmnh8_csi-driver-registrar.log
 -rw-r--r-- 1 50062 50062  77328 Mar 27 12:30 validator-5_20220327_123016_efs-csi-node-lmnh8_efs-plugin.log
 -rw-r--r-- 1 50062 50062 207000 Mar 27 12:30 validator-5_20220327_123016_efs-csi-node-lmnh8_liveness-probe.log

In this example, there appears to be a temporal correlation between the time of the incident (12:30:15) and the last log messages of that node's efs-csi-node pod (a failed unmount at 12:30:07). You can find the last hour of logs of that efs-csi-node pod attached.
Since the validators keep running when they run into the problem, we have a lot of time to exec into these pods and identify the root cause. Netstat doesn't show any open TCP connections, only a unix socket.

bash-4.4# netstat -aop
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
udp        0      0 0.0.0.0:60441           0.0.0.0:*                           1/java               off (0.00/0/0)
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags       Type       State         I-Node   PID/Program name     Path
unix  2      [ ]         STREAM     CONNECTED     61502970 1/java
Active Bluetooth connections (servers and established)
Proto  Destination       Source            State         PSM DCID   SCID      IMTU    OMTU Security
Proto  Destination       Source            State     Channel

The efs-csi-node pod has several hundred listening ports, established connections and unix sockets (see attached file).

Any pointers on how to proceed from here would be welcome, this issue has been plaguing us for a while now and since we now have three mounts instead of two as before, it is getting more frequent.

Regards

J

efs-csi-node_netstat.txt
validator-5_20220327_123016_efs-csi-node-lmnh8_efs-plugin.log

@0x0013
Copy link

0x0013 commented Apr 22, 2022

@jgoeres that's great amount of details. I hope someone knowledgeable can use your data to get to the bottom of this, seeing as it is hard to reproduce on purpose.

We are also plagued by this issue. Out of curiosity, did you observe the conflicting mounts to use the same port, in the output of mount command?

@ihatecodingaf
Copy link

@jgoeres We were facing a similar kind of issue. Seems like the efs version needs to upgraded to v1.3.7 . After this change we did not notice this issue.

@zoramthanga
Copy link

We're hitting the same issue with efs-csi driver v1.3.4. Can anyone else confirm whether v1.3.7 might have fixed the problem? cc: @wongma7

@0x0013
Copy link

0x0013 commented Jul 15, 2022

We were having this issue regularly with helm chart up to v2.2.4 (driver v1.3.6). After updating to chart v2.2.6 (driver v1.3.8) the reports of this issue stopped for more than a month.
But today, after yesterday updating to chart v2.2.7 (driver v1.4.0), we experienced the same issue today.

The behavior was the same as I noticed previously - same EFS volume mounted on different paths, and also confirmed that in mountpoint list (command mount) both conflicting mounts would show up with the exact same port number.

@padumergue
Copy link

padumergue commented Aug 1, 2022

My team is being impacted by the same issue, observed with versions 1.3.6 and 1.3.8. We mount three persistent volume over a single EFS volumes using dynamic provisioning. Our workload tries to read specific files from specific volumes and sometimes fails because of the miss-mount.

I tried digging a little and so far I'm thinking efs-utils which provide the support for mount -t efs might be the source of our troubles.

There is an MR open in there that tries to fix a race condition with port selection for the connection to EFS (aws/efs-utils#129). I'm not familiar with CSI retry policy and how this driver handles mount retries but it sound quite related with the issue here.

@tsmetana
Copy link

Hello. I was wondering: I run into something that sounds very similar (issue #695). Does anybody see similar symptomps in the CSI node driver and kubelet logs (some volumes are mounted multiple times)?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 24, 2022
@jgoeres
Copy link
Author

jgoeres commented Nov 24, 2022

/remove-lifecycle stale
This issue has just today resurfaced on a production system, so it should be kept active.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 24, 2022
@jsafrane
Copy link
Contributor

jsafrane commented Dec 1, 2022

I think it's caused by aws/efs-utils#125 - the race can result in a timeout / failure or in very rare cases it could IMO surface as efs-utils using the same stunnel port for two different volumes & mounting an unexpected one there. I have no reproducer though.

@lemosric
Copy link

lemosric commented Dec 1, 2022

We have identified the race condition that could cause this issue. The root cause is a race condition when selecting the TLS port: the race condition happens between sock.close() → stunnel established. We are already working on the fix.

@jsafrane
Copy link
Contributor

jsafrane commented Dec 1, 2022

@lemosric have you considered aws/efs-utils#129 ?

@lemosric
Copy link

lemosric commented Dec 1, 2022

Hi @jsafrane ,
Yes, we looked into the solution provided in aws/efs-utils/129. The PR has the right direction to close socket later. We are also shuffling the port on the available port range.

@Ashley-wenyizha
Copy link
Contributor

Latest release of efs-csi-driver have efs-utils v1.34.4 release which contains this fix. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests