Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inject additional libraries required for full display functionality #490

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

ehfd
Copy link

@ehfd ehfd commented May 10, 2024

Requested reviewer: @elezar

This PR tries the fix the remainder of issues which prevent proper X11 deployment in containers.

Largely, two things:

  • Add $LD_LIBRARY_PATH/gbm/nvidia-drm_gbm.so and other related libraries including libnvidia-egl-wayland.so.1 not already included by the container toolkit.
  • Make the container toolkit search for more paths than $LD_LIBRARY_PATH/nvidia/xorg/libglxserver_nvidia.so and $LD_LIBRARY_PATH/nvidia/xorg/nvidia_drv.so for the critical X11 libraries. This is because NVIDIA drivers installed with .run installs the libraries into /usr/lib/xorg/modules/extensions/libglxserver_nvidia.so.530.41.03 and /usr/lib/xorg/modules/drivers/nvidia_drv.so, as specified in https://download.nvidia.com/XFree86/Linux-x86_64/550.78/README/installedcomponents.html.

Please edit the PR as necessary.

Few things of note:

libnvidia-vulkan-producer.so is required before r550 but is removed with r550.

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ehfd.

Some questions to start with.

Comment on lines 278 to 284
libRoot + "/nvidia/xorg",
libRoot + "/xorg/modules/extensions",
"/usr/lib/xorg/modules/extensions",
"/usr/X11R6/modules/extensions",
libRoot + "/xorg/modules/drivers",
"/usr/lib/xorg/modules/drivers",
"/usr/X11R6/modules/drivers",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First question: Is the ordering of these important? Does it make sense to group the libRoot search paths?

Second: Are any of these paths controllable by envvars (such as XDG_DATA_DIRS for icd loaders)?

Copy link
Author

@ehfd ehfd May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Ordering is not important. I moved the libroot paths together.

  2. Envvars: ... I don't really think there's an envvar specifically for where the X.Org modules are.

The additional paths are because these libraries are not guaranteed to be installed in libRoot for some types of installations including .run despite the distro's preference to place libraries at a certain different location based on pkg-config, so they have to be hard-coded.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the below comment for an answer to this: #490 (comment)

@@ -271,13 +274,21 @@ func newXorgDiscoverer(logger logger.Interface, driver *root.Driver, nvidiaCTKPa
lookup.NewFileLocator(
lookup.WithLogger(logger),
lookup.WithRoot(driver.Root),
lookup.WithSearchPaths(libRoot, "/usr/lib/x86_64-linux-gnu"),
lookup.WithSearchPaths([]string{
libRoot + "/nvidia/xorg",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If libcuda.so.<RM_VERSION> is at /usr/lib/x86_64-linux-gnu, for example, libRoot will be /usr/lib/x86_64-linux-gnu is this the intent? Or should this always be /usr/lib/... or /usr/X11R/...?

Copy link
Author

@ehfd ehfd May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is the intent. I removed "/usr/lib/x86_64-linux-gnu" because it's not arch-neutral and redundant with libRoot.

In Ubuntu's APT installation for NVIDIA drivers, the paths where these packages are installed resolve to /usr/lib/x86_64-linux-gnu/nvidia/xorg for both X11-related .so files, as intended.

Also, /usr/lib/x86_64-linux-gnu/xorg/modules/extensions and /usr/lib/x86_64-linux-gnu/xorg/modules/drivers are also existing paths, so it's also possible for these .so files to round up there.

@@ -56,6 +56,9 @@ func NewGraphicsMountsDiscoverer(logger logger.Interface, driver *root.Driver, n
driver.Root,
[]string{
"libnvidia-egl-gbm.so.*",
"gbm/nvidia-drm_gbm.so",
"libnvidia-egl-wayland.so.*",
"libnvidia-vulkan-producer.so",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libnvidia-vulkan-producer.so is a symlink which should resolve to libnvidia-vulkan-producer.so.RM_VERSION. The latter should be detected by the "standard" driver libraries unless it's at a different location to libcuda.so.RM_VERSION.

Is the issue that the .RM_VERSION library is not injected, or that the symlink is not created?

Looking at the driver manifest we have:

libnvidia-vulkan-producer.so.535.86.10 0755 OPENGL_LIB NATIVE MODULE:egl
libnvidia-vulkan-producer.so 0000 UTILITY_LIB_SYMLINK NATIVE libnvidia-vulkan-producer.so.535.86.10 MODULE:egl

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither are injected. Is it ok if I simply put "libnvidia-vulkan-producer.so*",?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the driver manifest we have:

Where can I see this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manifest is available when extracting a driver .run file.

Neither are injected. Is it ok if I simply put "libnvidia-vulkan-producer.so*",?

No, since we need to treat symlinks correctly. I assume that the issue is then that libnvidia-vulkan-producer.so.<RM_VERSION> doesn't exist at the same location as libcuda.so.<RM_VERSION> on your system. Can you confirm this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Symlinks should be created via hook. A bind mount may fail.

I can rewrite this for the vulkan producer and gbm/nvidia-drm_gbm.so

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Symlinks should be created via hook. A bind mount may fail.

After looking at the code, this is not true here. The driver locator will take care of symlinks and resolve them on the host side.

@elezar
Copy link
Member

elezar commented May 10, 2024

If you could provide some steps on how you would go about validating correct functionality, that would also be great -- we are looking to improve our automated test coverage and having a starting point would be welcome.

…ix paths

Signed-off-by: Seungmin Kim <8457324+ehfd@users.noreply.github.com>
@ehfd
Copy link
Author

ehfd commented May 10, 2024

If you could provide some steps on how you would go about validating correct functionality, that would also be great -- we are looking to improve our automated test coverage and having a starting point would be welcome.

This is a pretty hard question...

The first priority regardless is to consult the NVIDIA driver team on structuring libraries and these things. Like a communication channel which you can get informed of library structures and such related aspects of the driver in various environments.

You have cooperative colleagues at work; so you should consider working with them.

  --x-prefix=X-PREFIX
      The prefix under which the X components of the NVIDIA driver will be installed; the default is '/usr/X11R6'
      unless nvidia-installer detects that X.Org >= 7.0 is installed, in which case the default is '/usr'.  Only under
      rare circumstances should this option be used.

  --xfree86-prefix=XFREE86-PREFIX
      This is a deprecated synonym for --x-prefix.

  --x-module-path=X-MODULE-PATH
      The path under which the NVIDIA X server modules will be installed.  If this option is not specified,
      nvidia-installer uses the following search order and selects the first valid directory it finds: 1) `X
      -showDefaultModulePath`, 2) `pkg-config --variable=moduledir xorg-server`, or 3) the X library path (see the
      '--x-library-path' option) plus either 'modules' (for X servers older than X.Org 7.0) or 'xorg/modules' (for
      X.Org 7.0 or later).

  --x-library-path=X-LIBRARY-PATH
      The path under which the NVIDIA X libraries will be installed.  If this option is not specified, nvidia-installer
      uses the following search order and selects the first valid directory it finds: 1) `X -showDefaultLibPath`, 2)
      `pkg-config --variable=libdir xorg-server`, or 3) the X prefix (see the '--x-prefix' option) plus 'lib' on 32bit
      systems, and either 'lib64' or 'lib' on 64bit systems, depending on the installed Linux distribution.

  --x-sysconfig-path=X-SYSCONFIG-PATH
      The path under which X system configuration files will be installed.  If this option is not specified,
      nvidia-installer uses the following search order and selects the first valid directory it finds: 1) `pkg-config
      --variable=sysconfigdir xorg-server`, or 2) /usr/share/X11/xorg.conf.d.

For example, the above (visible through ./nvidia-installer -A after sh NVIDIA-Linux-x86_64-550.78.run -x) is what causes the issue here when someone installs the NVIDIA driver without any X-related libraries installed to a host meant to be a K8s node.

WARNING: nvidia-installer was forced to guess the X library path '/usr/lib64' and X module path
           '/usr/lib64/xorg/modules'; these paths were not queryable from the system.  If X fails to find the NVIDIA X
           driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your
           distribution and reinstall the driver.

Leading to:

/usr/lib64/xorg/modules/extensions/libglxserver_nvidia.so
/usr/lib64/xorg/modules/extensions/libglxserver_nvidia.so.550.78
/usr/lib64/xorg/modules/drivers/nvidia_drv.so

For container hosts, neither X (provided by xserver-xorg in Ubuntu) nor the pkg-config for xorg-server (provided by xserver-xorg-dev in Ubuntu) tend to be typically installed (especially for K8s clusters).

If they exist in Ubuntu:
X -showDefaultModulePath: /usr/lib/xorg/modules
pkg-config --variable=moduledir xorg-server: /usr/lib/xorg/modules
X -showDefaultLibPath: /usr/lib/x86_64-linux-gnu
pkg-config --variable=libdir xorg-server: /usr/lib/x86_64-linux-gnu
pkg-config --variable=sysconfigdir xorg-server: /usr/share/X11/xorg.conf.d

Similarly for libglvnd: pkg-config --variable=datadir libglvnd: /usr/share

These are the candidate "environment variables" you are looking for, at least for .run installers. This is not guaranteed when a distro-provided NVIDIA driver package is installed instead of the .run file.


The second priority is to test different types of driver installation methods (because the above path method for .run isn't necessarily the same): Ubuntu/Debian's APT repository, RHEL/Fedora/SUSE-based RPM repository (perhaps even Arch), drivers within the NVIDIA CUDA repository (which contain .rpm and .deb, as well as other package forms), and of course the .run file, then check that the number of libraries provisioned into the container from the host are consistent.

Although the above should prevent most issues, checking whether X11 is started correctly is a different story, and probably outside the scope of this project unless NVIDIA decides to consult my container: https://github.com/selkies-project/docker-nvidia-glx-desktop and provides an official one.


In conclusion, this is a really complex issue, where this PR did only partially solve the whole issue.

For example, on top of this PR, it would be ideal if it's possible to push nvidia_drv.so and libglxserver_nvidia.so.* into /usr/lib/x86_64-linux-gnu/nvidia/xorg/ or (in a different notation) libRoot + /nvidia/xorg/ inside the container, regardless of where it was found, and the container toolkit to generate a /usr/share/X11/xorg.conf.d/10-nvidia.conf file with the following content regardless of whether 10-nvidia.conf exists in the host:

Section "OutputClass"
    Identifier "nvidia"
    MatchDriver "nvidia-drm"
    Driver "nvidia"
    Option "AllowEmptyInitialConfiguration"
    ModulePath "/usr/lib/x86_64-linux-gnu/nvidia/xorg"
EndSection

The above is the default behavior for the nvidia-driver-550 from Ubuntu APT and I generally like this approach a lot.


nvidia-xconfig (and also possibly nvidia-config, combined with the NVIDIA GTK libraries and libnvidia-wayland-client.so which are dependencies) should also be pushed into the container for a full X11 experience.

Then, the X11 aspect will be solved and we developers can call it a day.

Also, injecting 32-bit libraries is definitely desirable for usage with Wine/Proton/etc.

@ehfd ehfd requested a review from elezar May 10, 2024 15:53
Signed-off-by: Seungmin Kim <8457324+ehfd@users.noreply.github.com>
@@ -56,6 +56,9 @@ func NewGraphicsMountsDiscoverer(logger logger.Interface, driver *root.Driver, n
driver.Root,
[]string{
"libnvidia-egl-gbm.so.*",
"gbm/nvidia-drm_gbm.so",
"libnvidia-egl-wayland.so.*",
"libnvidia-vulkan-producer.so*",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want:

Suggested change
"libnvidia-vulkan-producer.so*",
"libnvidia-vulkan-producer.so.*",

and need to create the .so symlink if required as a hook.

libRoot + "/xorg/modules/extensions",
libRoot + "/xorg/modules/drivers",
"/usr/lib/xorg/modules/extensions",
"/usr/lib/xorg/modules/drivers",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like we're searching the following 2 paths: []string{"extensions", "drivers"} in []string{"modules", "modules/updates"} across the following "roots": []string{libRoot + "/xorg", "/usr/lib", "/usr/lib64", "/usr/X11R6"}.

Does it make sense to construct this dynamically?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to push to my branch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar How's the progress?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this one

Co-authored-by: Evan Lezar <evanlezar@gmail.com>
Signed-off-by: Seungmin Kim <8457324+ehfd@users.noreply.github.com>
@ehfd
Copy link
Author

ehfd commented Jun 2, 2024

Another update: gbm/nvidia-drm_gbm.so seems to be a symlink to libnvidia-allocator.so.1.

games-on-whales/wolf#52 (comment)

@elezar

@elezar
Copy link
Member

elezar commented Jun 3, 2024

Thanks for the update @ehfd. I will have a look at this again this week.

linkDir := filepath.Dir(mount.Path)
linkPath := filepath.Join(linkDir, "gbm", "nvidia-drm_gbm.so")
target := filepath.Join("..", filename)
links = append(links, fmt.Sprintf("%s::%s", target, linkPath))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar I changed this to a symlink hook, but I'm not sure if this should be done this way.

When doing it via NewMounts as before, the locator will attempt to resolve the symlink on the host side. That means the path gets only injected when the library actually exists on the host and future changes to this path (link target changes, becoming a regular file, etc...) will also be covered.

So I'd advocate to put this back to NewMounts as originally suggested by @ehfd

Your Thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depends on @elezar 's policies on this matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that adding gbm/nvidia-drm_gbm.so to the NewMounts will have the desired effect. Since this resolves to libnvidia-allocator.so we will only include libnvidia-allocator.so in the list of mounts and NOT create the symlink.

Note that you comment on handling changes in this path is not 100% accurate. While this is true for gbm/nvidia-drm_gbm.so, changing this to a regular file would have the downside that we may lose the resolution of the target library in this case.

Your comment does raise an interesting question though. If gbm/nvidia-drm_gbm.so exists on the host, then we could follow a similar strategy to what we do for Tegra-based systems here.

Where we distinguish between regular libraries and symlinks. In both cases any symlinks are followed to ensure that we only mount regular files into a container, but in the case of symlinks we recreate the entire symlink chain in the container. If we were to factor out this logic to be reused here, we could treat the symlinks here explicitly as separate entities to discover.

I do think that thsi can be done as a follow-up to this PR though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that merging first is better since this is taking such a long time.

@elezar So are you satisfied with his changes?

@ehfd
Copy link
Author

ehfd commented Jun 5, 2024

Thank you for the contributions on the Go code! @tux-rampage

@ehfd
Copy link
Author

ehfd commented Jun 5, 2024

@tux-rampage Note that something like Signed-off-by: Name <email@users.noreply.github.com> is required in your commits in order to be merged into this repository. Might need to rebase.

@ehfd
Copy link
Author

ehfd commented Jun 5, 2024

In conclusion, this is a really complex issue, where this PR did only partially solve the whole issue.

For example, on top of this PR, it would be ideal if it's possible to push nvidia_drv.so and libglxserver_nvidia.so.* into /usr/lib/x86_64-linux-gnu/nvidia/xorg/ or (in a different notation) libRoot + /nvidia/xorg/ inside the container, regardless of where it was found, and the container toolkit to generate a /usr/share/X11/xorg.conf.d/10-nvidia.conf file with the following content regardless of whether 10-nvidia.conf exists in the host:

Section "OutputClass"
    Identifier "nvidia"
    MatchDriver "nvidia-drm"
    Driver "nvidia"
    Option "AllowEmptyInitialConfiguration"
    ModulePath "/usr/lib/x86_64-linux-gnu/nvidia/xorg"
EndSection

The above is the default behavior for the nvidia-driver-550 from Ubuntu APT and I generally like this approach a lot.


nvidia-xconfig (and also possibly nvidia-config, combined with the NVIDIA GTK libraries and libnvidia-wayland-client.so which are dependencies) should also be pushed into the container for a full X11 experience.

Then, the X11 aspect will be solved and we developers can call it a day.

Also, injecting 32-bit libraries is definitely desirable for usage with Wine/Proton/etc.

Some thoughts or contributions on this part of the issue can also help a lot. @tux-rampage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants