Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create .so symlinks for driver libraries in container #326

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

elezar
Copy link
Member

@elezar elezar commented Feb 1, 2024

When NVIDIA_DOT_SO_SYMLINKS is not equal to enabled:

elezar@dgx0126:~/src/container-toolkit$ docker run --rm -ti --runtime=nvidia-dev -e NVIDIA_DOT_SO_SYMLINKS=not-enabled -e NVIDIA_VI
SIBLE_DEVICES=all ubuntu bash -c "ls -al /usr/lib/x86_64-linux-gnu/libnvidia-ml.*"
lrwxrwxrwx 1 root root      26 Nov 27 14:19 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 -> libnvidia-ml.so.515.105.01
-rw-r--r-- 1 root root 1683960 Feb 27  2023 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.515.105.01

When NVIDIA_DOT_SO_SYMLINKS=enabled:

elezar@dgx0126:~/src/container-toolkit$ docker run --rm -ti --runtime=nvidia-dev -e NVIDIA_DOT_SO_SYMLINKS=enabled -e NVIDIA_VISIBL
E_DEVICES=all ubuntu bash -c "ls -al /usr/lib/x86_64-linux-gnu/libnvidia-ml.*"
lrwxrwxrwx 1 root root      17 Nov 27 14:19 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx 1 root root      26 Nov 27 14:19 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 -> libnvidia-ml.so.515.105.01
-rw-r--r-- 1 root root 1683960 Feb 27  2023 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.515.105.01

This also works with CDI:

docker run --rm -ti --device elezar.nvidia.com/gpu=all ubuntu
root@9d74d452dc2c:/# ls -al /usr/lib/x86_64-linux-gnu/libnvidia-ml.*
lrwxrwxrwx 1 root root      17 Apr  3 14:35 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx 1 root root      25 Apr  3 14:35 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 -> libnvidia-ml.so.550.54.15
-rw-r--r-- 1 root root 2078360 Mar  5 22:01 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.550.54.15
root@9d74d452dc2c:/# ls -al /usr/lib/x86_64-linux-gnu/libcuda.*
lrwxrwxrwx 1 root root       12 Apr  3 14:35 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       20 Apr  3 14:35 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.550.54.15
-rw-r--r-- 1 root root 28392536 Mar  5 22:26 /usr/lib/x86_64-linux-gnu/libcuda.so.550.54.15

Note that a --no-dot-so-symlinks option is added to the nvidia-ctk cdi generate command.

Migrated from https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/510

Depends on #327

@elezar elezar self-assigned this Feb 13, 2024
@elezar elezar force-pushed the CNT-4766/create-so-symlinks branch 4 times, most recently from e20826f to 7b5f6b3 Compare April 3, 2024 15:10
@elezar elezar marked this pull request as ready for review April 3, 2024 15:10
Copy link
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial pass. I lost some steam near the end, but wanted to drop the comments I had for now.

// Create the '' command
c := cli.Command{
Name: "create-dot-so-symlinks",
Usage: "A hook to create symlinks in the container. This can be used to process CSV mount specs",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this specific to CSV mount specs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This is not correct. Will update.

Comment on lines 90 to 94
libs, err := lookup.NewLibraryLocator(
lookup.WithLogger(m.logger),
lookup.WithRoot(containerRoot),
lookup.WithOptional(true),
).Locate("*.so." + cfg.driverVersion)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you separate this into two calls? Chaining on separate lines like this always looks awkward to me in go code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will update.

continue
}
libSoPath := strings.TrimSuffix(lib, "."+cfg.driverVersion)
libSoXPaths, err := filepath.Glob(libSoPath + ".[0-9]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably fine for our purposes, but should this actually be ".[0-9]+" t include multi-digit versions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should. I was thinking one should probably rather extract the SONAME from the target library and do it that way, but thought this heuristic was enough.

One issue is that a Glob pattern is not a regex and the + is not available there.

I suppose instead of including this logic, we chould just symlink .so -> .so.driverVersion directly and bypass the .so.[0-9] symlink here. Thoughts?

Comment on lines 39 to 36
// DotSoSymlinks allows for the creation of .so symlinks to .so.1 driver
// files to be opted out of.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it allows them to opt-in, correct? We just happen to set the default to true so that opting-out is the "opposite" action, but the flag still exists so that one may opt-in (meaning true==create them, false==don't create them).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I initially had this as "on-by-default" with opt-out logic, but rolled that back and didn't update the comment. Will update.

Comment on lines 28 to 30
// featureNotControlledByEnvvar is used for features that have no envvar to
// allow per-container opt-in.
featureNotControlledByEnvvar = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately obvious reading this top-to-bottom how this will be used. From what I can tell, we will allow the feature being added here to be controlled by an envvar (namely NVIDIA_DOT_SO_SYMLINKS), so I'm not sure where this is going to apply yet (maybe it will be come clear later on).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: I see now -- this is just giving a name to what was previously a hard coded "".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said -- my gut says that leaving it as a hard-coded "" with a comment above the one place it is checked is clearer than adding this constant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I initially had this as "" in the code and thought a constant read a little bit better. Let me have another look at it.

Comment on lines 73 to 75
libCudaPaths, err := cuda.New(
r.Libraries(),
).Locate(".*.*")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this chaining looks awkward to me, but I don't feel too strongly against it if that is your preference.

Copy link
Member Author

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of the issues were because this was out of date.

// Create the '' command
c := cli.Command{
Name: "create-dot-so-symlinks",
Usage: "A hook to create symlinks in the container. This can be used to process CSV mount specs",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This is not correct. Will update.

Comment on lines 90 to 94
libs, err := lookup.NewLibraryLocator(
lookup.WithLogger(m.logger),
lookup.WithRoot(containerRoot),
lookup.WithOptional(true),
).Locate("*.so." + cfg.driverVersion)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will update.

continue
}
libSoPath := strings.TrimSuffix(lib, "."+cfg.driverVersion)
libSoXPaths, err := filepath.Glob(libSoPath + ".[0-9]")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should. I was thinking one should probably rather extract the SONAME from the target library and do it that way, but thought this heuristic was enough.

One issue is that a Glob pattern is not a regex and the + is not available there.

I suppose instead of including this logic, we chould just symlink .so -> .so.driverVersion directly and bypass the .so.[0-9] symlink here. Thoughts?

Comment on lines 28 to 30
// featureNotControlledByEnvvar is used for features that have no envvar to
// allow per-container opt-in.
featureNotControlledByEnvvar = ""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I initially had this as "" in the code and thought a constant read a little bit better. Let me have another look at it.

@elezar elezar force-pushed the CNT-4766/create-so-symlinks branch 2 times, most recently from df99dc3 to 93753a9 Compare April 15, 2024 14:28
This change adds an opt-in feature for creating .so symlinks to
all injected driver files in a contianer.

If features.dot-so-symlinks = true is set in the config.toml, the creation
of symlinks for driver files is enabled. This can also be triggered on a
per-container basis using the envvar NVIDIA_DOT_SO_SYMLINKS=enabled.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the CNT-4766/create-so-symlinks branch from 93753a9 to 020f598 Compare April 15, 2024 14:31
@elezar elezar requested a review from klueska May 15, 2024 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants