-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI volumes staging mount path collision between namespaces with CSI plugins that support staging #18741
Comments
I've tested it with
So the issue seems to be originating from https://github.com/hashicorp/nomad/blob/v1.6.2/client/pluginmanager/csimanager/volume.go#L170 which doesn't include the namespace, if any:
|
Hi @the-nando and @ygg-drop for raising this issue, I'll add this to our roadmapping list. |
I've tested this in my lab and see the same results with the staging directory. I also noticed another oddity when creating volumes in separate namespaces where the volume name is the same. When the name is the same for two volumes in separate nomad namespaces only one volume was created on the storage side, which was accessable by both jobs in different namespaces. Output from both scenarios are below. Same ID in Volume fileVolume 1
Volume 2
VOLUME1 - NS A
VOLUME2 - NS B
Job 1
JOB2
Same Name in Volume fileVolume 1
Volume 2
VOLUME1 - NS A
VOLUME2 - NS B
Controller log during volumes create
Job 1 run and file created in volume
Job 2 run and ls of its volume
|
Hi folks, just a heads up that I'm picking this up (as well as #20424). But @ron-savoia I've split out #20530 around the |
CSI volumes are are namespaced. But the client does not include the namespace in the staging mount path. This causes CSI volumes with the same volume ID but different namespace to collide if they happen to be placed on the same host. Fixes: #18741
CSI volumes are are namespaced. But the client does not include the namespace in the staging mount path. This causes CSI volumes with the same volume ID but different namespace to collide if they happen to be placed on the same host. Fixes: #18741
Initial draft PR is up here #20532. I think the upgrade path ends up being ok, but I need to do some end-to-end testing to verify that before marking this ready for review. Will do that testing early next week. |
Upgrade testing didn't go so well, and I've at least broken unstaging when clients are upgraded before servers (which isn't the recommended upgrade path but we want to handle it gracefully). Going to do some test code rework that'll help debug this plus #20424 in more fine detail. |
CSI volumes are are namespaced. But the client does not include the namespace in the staging mount path. This causes CSI volumes with the same volume ID but different namespace to collide if they happen to be placed on the same host. Fixes: #18741
CSI volumes are are namespaced. But the client does not include the namespace in the staging mount path. This causes CSI volumes with the same volume ID but different namespace to collide if they happen to be placed on the same host. Fixes: #18741
After a bit of rework I've tested the upgrade path from 1.8.0-beta.1 to the patch I've got in #20532. Looks like this should work now and I'll mark it ready for review. Test details below. Existing behaviorFirst, I started with 1.8.0-beta.1 and a running allocation that consumes a CSI volume. I see the following filesystem and mounts. $ sudo tree /var/nomad/data/client/csi
/var/nomad/data/client/csi
├── controller
│ └── org.democratic-csi.nfs
├── node
│ └── org.democratic-csi.nfs
│ ├── per-alloc
│ │ └── 55113309-8135-fd80-112b-d9f0f2c4cc6f # consuming alloc
│ │ └── csi-volume-nfs
│ │ └── rw-file-system-single-node-writer # per-alloc mount point
│ │ └── test.txt
│ └── staging
│ └── csi-volume-nfs # staging is not namespaced
│ └── rw-file-system-single-node-writer # staging mount point
│ └── test.txt
└── plugins
├── 13bf0a2a-7866-7ded-8436-2c53f1268a41
│ └── csi.sock
└── 60e62cfc-0bbd-19ff-8f4d-a97b8e17d5cd
└── csi.sock
14 directories, 4 files
$ mount | grep csi-volume-nfs
192.168.1.170:/srv/nfs_data/v/csi-volume-nfs on /var/nomad/data/client/csi/node/org.democratic-csi.nfs/staging/csi-volume-nfs/rw-file-system-single-node-writer type nfs4 (rw,noatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.37.105.17,local_lock=none,addr=192.168.1.170)
192.168.1.170:/srv/nfs_data/v/csi-volume-nfs on /var/nomad/data/client/csi/node/org.democratic-csi.nfs/per-alloc/55113309-8135-fd80-112b-d9f0f2c4cc6f/csi-volume-nfs/rw-file-system-single-node-writer type nfs4 (rw,noatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.37.105.17,local_lock=none,addr=192.168.1.170) I then stop the job, so that we have a baseline behavior.
Note that this leaves behind the parent directories of the mount points. That's mostly harmless but not very tidy, so I've opened #20544 to follow-up on fixing that. Client UpgradeThe first test is upgrading the client first (although this isn't our recommended approach). First I run the job, and I see the following filesystem and mounts.
Then I upgrade the client from 1.8.0-beta.1 to the patch in #20532 and restart. I checked the filesystem and mounts are unchanged after restoring the allocation (as expected). Then I stopped the job, and see that the old mounts and paths are cleaned up just as before (the claim is also released):
Then I started the job again, and see that staging is properly namespaced.
Server UpgradeNext I wiped the client and server and started over from a clean datadir and 1.8.0-beta.1. After deplying the job that consumes the volume, I have the following filesystem and mounts:
Then I upgrade the server to the patched version and restart it. Then, just to verify restore is good I restarted the client without upgrading. As expected, that's all good. Next I stopped the job, and everything unmounted as expected. The volume claim was also freed.
Then I re-ran the job (now with upgraded server but non-upgraded client), and see the old client behavior is still safely in place.
Lastly, I upgraded the client as well. I stopped the job, and restarted the job.
|
Bah, I missed that the unstage code path for the new code isn't quite working as expected. Need to fix that. The plugin returns no error in that case but it's not unstaging because the unstage path for some reason doesn't include the namespace. I've had a pass through the code but it's not yet obvious why this is missing. Will pick that up tomorrow morning.
|
Ok, the problem was that I was checking the existence of the staging path using the path inside the plugin container, which of course will never exist from the perspective of the CSI hook. With that adjustment, everything appears to be working as we want. Mount / unmount with patched versionAfter running the job, our filesystem and mounts are as expected.
I stop the job and see the mounts are cleaned up.
Just to double-check everything is good, I grabbed the trace logs for the allocation and those look ok.
Client upgradeReset both hosts to 1.8.0-beta.1, start from a fresh datadir, and deploy the job.
Upgrade the client to the patched version and restart it. Restoring the alloc looks good in the trace logs.
Stop the job and see everything is unmounted as we'd hope.
Start the job again (still using the old server, but with the new client), and see the namespaced staging dir now.
Stop the job again, just to verify that the old server can't interfere with cleanup on a new client.
Server upgradeReset both client and server to 1.8.0-beta.1, start from a fresh data dir, and run the job.
Upgrade the server and restart it. Then stop the job. Everything is unmounted as expected.
Run the job again (without upgrading client), just to make sure a new server can't force incorrect behavior on an old client. This looks as expected -- the bug is still in place on the client but the mount works.
Now update the client and restart it, and stop the job. Everything is unmounted as expected.
Start the job again, showing that new server + new client results in namespaced staging as we'd expect.
|
CSI volumes are are namespaced. But the client does not include the namespace in the staging mount path. This causes CSI volumes with the same volume ID but different namespace to collide if they happen to be placed on the same host. Fixes: #18741
CSI volumes are namespaced. But the client does not include the namespace in the staging mount path. This causes CSI volumes with the same volume ID but different namespace to collide if they happen to be placed on the same host. The per-allocation paths don't need to be namespaced, because an allocation can only mount volumes from its job's own namespace. Rework the CSI hook tests to have more fine-grained control over the mock on-disk state. Add tests covering upgrades from staging paths missing namespaces. Fixes: #18741
CSI volumes are namespaced. But the client does not include the namespace in the staging mount path. This causes CSI volumes with the same volume ID but different namespace to collide if they happen to be placed on the same host. The per-allocation paths don't need to be namespaced, because an allocation can only mount volumes from its job's own namespace. Rework the CSI hook tests to have more fine-grained control over the mock on-disk state. Add tests covering upgrades from staging paths missing namespaces. Fixes: #18741
…e/1.6.x (#20572) CSI volumes are namespaced. But the client does not include the namespace in the staging mount path. This causes CSI volumes with the same volume ID but different namespace to collide if they happen to be placed on the same host. The per-allocation paths don't need to be namespaced, because an allocation can only mount volumes from its job's own namespace. Rework the CSI hook tests to have more fine-grained control over the mock on-disk state. Add tests covering upgrades from staging paths missing namespaces. Fixes: #18741 Co-authored-by: Tim Gross <tgross@hashicorp.com>
#20532 has been merged and will ship in Nomad 1.8.0 (with backports to supported versions) |
Nomad version
Output from
nomad version
Operating system and Environment details
Ubuntu 23.04
Issue
CSI volumes are namespace-scoped in Nomad, yet Nomad does not include the namespace name in mount path for CSI plugins that support staging (like
ceph-csi
). If there are namespaces with CSI volumes that have the same ID, then when jobs from those namespace get scheduled on the same node the staging mount path for those volumes will collide.Reproduction steps
ceph-csi
)testvol
in both namespacestestvol
and is constrained to a certain nodetestvol
and is constrained to the same node as job in namespace AExpected Result
Both jobs should run successfully and each have access to the
testvol
CSI volume from their respective namespace.Actual Result
I only tested with
multi-node-multi-writer
(CephFS volume) and the result was thattestvol
from namespace A was bind-mounted toper-alloc
directory for an allocation from a job from namespace B. This is a potential security issue.The staging path looks like
$NOMAD_DATA_DIR/client/csi/node/$CSI_PLUGIN_ID/staging/testvol/rw-file-system-multi-node-multi-writer
. The namespace is not included in the path.Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: