Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move nvidia-ctk hook command into own binary #474

Merged
merged 1 commit into from
May 21, 2024

Conversation

deitch
Copy link
Contributor

@deitch deitch commented Apr 24, 2024

Fixes #435

This does two things.

First: creates a new binary called nvidia-cdi-hook, as suggested by @klueska in #435. For backwards compatibility and ease of migration, I kept the functionality in nvidia-ctk as well as in nvidia-cdi-hook, so the following are equal:

$ nvidia-ctk hook create-symlinks
$ nvidia-cdi-hook create-symlinks

To avoid duplication, all of the command code that was under cmd/nvidia-ctk/hook/{create-symlinks,update-ldcache,chmod}/ was moved to cmd/nvidia-cdi-hook/, and cmd/nvidia-ctk/hook/hook.go references those. This keeps the same functionality in place for nvidia-ctk, while avoiding duplication of code.

Second, I changed references in nvidia-ctk cdi generate from putting hooks that call nvidia-ctk hook {subcommand} with nvidia-cdi-generate {subcommand}. I am not convinced I got it correctly, and kept properties that reference nvidiaCTKPath in addition to nvidiaCDIHookPath, just in case.

There are 16 places left where nvidiaCTKPath remains in the code, almost all of them properties or setting options, which means they probably could be removed.

As far as I could tell, and I am not positive, references to nvidia-ctk inside the code are used only for creating hooks. But I am not 100% sure, hence leaving them around. Looking forward to comments.

@deitch
Copy link
Contributor Author

deitch commented Apr 24, 2024

I also am concerned that I missed some of the tests, or that there is a config property that needs to be added.

@elezar elezar self-assigned this Apr 24, 2024
@elezar
Copy link
Member

elezar commented Apr 24, 2024

@deitch thanks for the contribution. I'll have a look at it early next week.

* `create-symlinks` - Create symlinks inside the directory path to be mounted into a container.
* `update-ldcache` - Update the dynamic linker cache inside the directory path to be mounted into a container.

The CDI itself is created separately, normally using the `nvidia-ctk cdi` command.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document how to actually use the hook? Is it triggered automatically or users need to configure something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand. It doesn't change from current behaviour. It's inserted into the cdi yaml when you run nvidia-ctk cdi generate

Copy link

@lahwaacz lahwaacz Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, can you document the current behavior?

For me it does not work by default though, nvidia-ctk cdi generate creates a CDI spec where nvidia-ctk is used only in hooks: sections, but it is missing in the mounts: section. Then e.g. when I run podman run --rm -it --gpus all nvidia/cuda:12.4.1-devel-ubuntu22.04, /usr/bin/nvidia-ctk is not available in the container.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, can you document the current behavior?

Sure, on its way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it does not work by default though, nvidia-ctk cdi generate creates a CDI spec where nvidia-ctk is used only in hooks: sections, but it is missing in the mounts: section. Then e.g. when I run podman run --rm -it --gpus all nvidia/cuda:12.4.1-devel-ubuntu22.04, /usr/bin/nvidia-ctk is not available in the container.

Do you mind posting your CDI here? I have 2 Jetsons, but unfortunately no access to them for the next week, so I have nothing ti generate CDIs on, in order to tests it. I also don't have remote access to anything with proper GPUs, so I am stuck working this through here without proper testing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I see the same behavior on the main branch, but not on the release-1.15 branch. I've compiled the same way from all 3 branches so the issue does not seem related to compiler flags etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting. So I didn't introduce the bug.

If you've compiled it from main, do you mind generating the CDI yaml from main, so we can compare? It doesn't get us all of the way, but it helps.

Hopefully @klueska or one of the other maintainers will know what we need to do to complete the devices.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lahwaacz . Those are identical, except for the executable. That is a good sign.

It means that whatever introduced the issue with the busId, is in main, and was not introduced by this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick check of main vs v1.15.0, there is quite a bit in there that I do not understand. I think we are stuck until someone who gets how this works explains how to fix it in main (and we can apply here).

@deitch
Copy link
Contributor Author

deitch commented May 1, 2024

Following up on this. Can someone help us understand what is different in main vs 1.15.0, such that the config it missing devices? How can we fix this?

@elezar
Copy link
Member

elezar commented May 3, 2024

Following up on this. Can someone help us understand what is different in main vs 1.15.0, such that the config it missing devices? How can we fix this?

I was able to reproduce the issue regarding the bus information for the device being empty for main. It seems as if this was related to our changes to the go-nvml wrappers. After updating the go-nvml dependency to v0.12.0-5 this issue no longer presents.

@deitch
Copy link
Contributor Author

deitch commented May 3, 2024

Excellent, thank you @elezar . It looks like that already is in, so I should be able to rebase on main and it will work?

@deitch
Copy link
Contributor Author

deitch commented May 3, 2024

@lahwaacz I rebased on main, with the fix that @elezar put in. Do you mind running your generation again? Off of v1.15.0, main and this branch, and then posting the 3 in a linked gist, so we can compare?

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deitch I think I'm ok with the mechanics of this. The one question I have is how to support the transition from the use of nvidia-ctk hook to nvidia-cdi-hook.

Note that we would also need to include the new executable in the deb and rpm package definitions.

Name: "nvidia-ctk-path",
Usage: "Specify the path to use for the nvidia-ctk in the generated CDI specification. If this is left empty, the path will be searched.",
Destination: &opts.nvidiaCTKPath,
Name: "nvidia-cdi-hook-path",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is a breaking change. Does it make sense to add an Alias here instead and then add logic somewhere(TM) for determining whether the nvidia-cdi-hook or nvidia-ctk is used to know how we will generate the CDI spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

breaking change

Do you mean that someone who might have some command process (human or scripted) that called this with --nvidia-ctk-path will now break, because that option doesn't exist?

I would think (cautiously, and probably wrongly) that it isn't that heavily used an option, and if you install a new version of the toolkit, it is ok for this to break.

If we do not want to allow that, then, sure, I can add an Alias. Your call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know that this may not be the most used option, but I would prefer to either add an alias or add a new option -- possibly marking the old one as deprecated. Given the discussion around a hook constructor below, maybe a separate option is better -- caling out in the usage text that this is intended to replace the nvidia-ctk command.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of behaviour do we want for that? The deprecation is easy to add to the comment:

		&cli.StringFlag{
			Name:        "nvidia-ctk-path",
			Usage:       "Specify the path to use for the nvidia-ctk in the generated CDI specification. If this is left empty, the path will be searched. DEPRECATED. Use flag --nvidia-cdi-hook-path instead.",
		},

but when someone uses it, should we:

  • treat it like they called --nvidia-cdi-hook-path and set that? I don't much like it, as they are distinct binaries.
  • do nothing with it? That doesn't seem to add value.
  • save it to somewhere and use it? But we don't use that anywhere.

🤷‍♂️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this ASSUMES that users have not renamed their nvidia-ctk executable for the purpose of injecting the hook

If a user does that, they only have themselves to blame. You always can break something if you try hard enough. If they replace /usr/bin/containerd with a symlink to /usr/sbin/halt, I don't think we can help them 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would include logic to handle both the nvidia-cdi-hook and nvidia-ctk case from the perspective of CDI spec generation. This means that we can add an aliases --nvidia-cdi-hook-path and --nvidia-ctk-path and have them treated consistently

So

  • if the user uses --nvidia-ctk-path, then it would inject nvidia-ctk hook (2 words) everywhere in CDI
  • if the user uses --nvidia-cdi-hook-path, then it would inject nvidia-cdi-hook (1 word) everywhere in CDI
  • if the user uses both, it is an error
  • if the user uses neither, it uses nvidia-cdi-hook as the default

If I understood you correctly, sure, that works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adopted your suggestions right in. Once I saw it inside, totally made sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything else we need for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does:

diff --git a/cmd/nvidia-ctk/cdi/generate/generate.go b/cmd/nvidia-ctk/cdi/generate/generate.go
index fba22301..9f9e994b 100644
--- a/cmd/nvidia-ctk/cdi/generate/generate.go
+++ b/cmd/nvidia-ctk/cdi/generate/generate.go
@@ -132,13 +132,11 @@ func (m command) build() *cli.Command {
 			Destination: &opts.librarySearchPaths,
 		},
 		&cli.StringFlag{
-			Name:        "nvidia-cdi-hook-path",
-			Usage:       "Specify the path to use for the nvidia-cdi-hook in the generated CDI specification. If this is left empty, the path will be searched.",
-			Destination: &opts.nvidiaCDIHookPath,
-		},
-		&cli.StringFlag{
-			Name:        "nvidia-ctk-path",
-			Usage:       "Specify the path to use for the nvidia-ctk in the generated CDI specification. If this is left empty, the path will be searched. DEPRECATED. Use flag --nvidia-cdi-hook-path instead.",
+			Name:    "nvidia-cdi-hook-path",
+			Aliases: []string{"nvidia-ctk-path"},
+			Usage: "Specify the path to use for the nvidia-cdi-hook in the generated CDI specification. " +
+				"If not specified, the PATH will be searched for `nvidia-cdi-hook`. " +
+				"NOTE: That if this is specified as `nvidia-ctk`, the PATH will be searched for `nvidia-ctk` instead.",
 			Destination: &opts.nvidiaCDIHookPath,
 		},
 		&cli.StringFlag{

make sense?

internal/discover/graphics.go Show resolved Hide resolved
@deitch
Copy link
Contributor Author

deitch commented May 3, 2024

The one question I have is how to support the transition from the use of nvidia-ctk hook to nvidia-cdi-hook.

Is there what to do for this? Since they are packaged together, if you install and run a newer version of a package with ctk, it installs nvidia-cdi-hook, and nvidia-ctk cdi generate will generate CDIs with nvidia-cdi-hook rather than nvidia-ctk hook. If you update your package, but have a CDI generated by an older version of nvidia-ctk cdi generate, it still will work, because the newer version of nvidia-ctk continues to support nvidia-ctk hook.

Note that we would also need to include the new executable in the deb and rpm package definitions

Is that in here?

@elezar
Copy link
Member

elezar commented May 6, 2024

Is that in here?

Yes. That's the location where the executable should be added (to the nvidia-container-toolkit-base packages).

@deitch
Copy link
Contributor Author

deitch commented May 6, 2024

Yes. That's the location where the executable should be added (to the nvidia-container-toolkit-base packages).

OK, I think I got those @elezar ; please take a look.

The only remaining thing is this.

We keep --nvidia-ctk-path <path> around as an option, but what do we do with it?

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @deitch. It's looking much better already.

There are some minor changes required.

cmd/nvidia-cdi-hook/README.md Outdated Show resolved Hide resolved
cmd/nvidia-cdi-hook/README.md Outdated Show resolved Hide resolved
[`nvidia-ctk cdi`](../nvidia-ctk/) command.

When `nvidia-ctk cdi generate` is run, the CDI specification is generated as a yaml file.
The CDI file provides instructions for a container runtime to set up devices, files and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The CDI file provides instructions for a container runtime to set up devices, files and
The CDI specification provides instructions for a container runtime to set up devices, files and

cmd/nvidia-cdi-hook/README.md Outdated Show resolved Hide resolved
cmd/nvidia-cdi-hook/main.go Outdated Show resolved Hide resolved
Name: "nvidia-ctk-path",
Usage: "Specify the path to use for the nvidia-ctk in the generated CDI specification. If this is left empty, the path will be searched.",
Destination: &opts.nvidiaCTKPath,
Name: "nvidia-cdi-hook-path",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does:

diff --git a/cmd/nvidia-ctk/cdi/generate/generate.go b/cmd/nvidia-ctk/cdi/generate/generate.go
index fba22301..9f9e994b 100644
--- a/cmd/nvidia-ctk/cdi/generate/generate.go
+++ b/cmd/nvidia-ctk/cdi/generate/generate.go
@@ -132,13 +132,11 @@ func (m command) build() *cli.Command {
 			Destination: &opts.librarySearchPaths,
 		},
 		&cli.StringFlag{
-			Name:        "nvidia-cdi-hook-path",
-			Usage:       "Specify the path to use for the nvidia-cdi-hook in the generated CDI specification. If this is left empty, the path will be searched.",
-			Destination: &opts.nvidiaCDIHookPath,
-		},
-		&cli.StringFlag{
-			Name:        "nvidia-ctk-path",
-			Usage:       "Specify the path to use for the nvidia-ctk in the generated CDI specification. If this is left empty, the path will be searched. DEPRECATED. Use flag --nvidia-cdi-hook-path instead.",
+			Name:    "nvidia-cdi-hook-path",
+			Aliases: []string{"nvidia-ctk-path"},
+			Usage: "Specify the path to use for the nvidia-cdi-hook in the generated CDI specification. " +
+				"If not specified, the PATH will be searched for `nvidia-cdi-hook`. " +
+				"NOTE: That if this is specified as `nvidia-ctk`, the PATH will be searched for `nvidia-ctk` instead.",
 			Destination: &opts.nvidiaCDIHookPath,
 		},
 		&cli.StringFlag{

make sense?

@@ -49,6 +49,7 @@ type nvcdilib struct {
driverRoot string
devRoot string
nvidiaCTKPath string
nvidiaCDIHookPath string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're effectively replacing nvidiaCTKPath with this member, I would recommend applying the following diff:

diff --git a/internal/modifier/cdi.go b/internal/modifier/cdi.go
index c53cb996..c5af4f88 100644
--- a/internal/modifier/cdi.go
+++ b/internal/modifier/cdi.go
@@ -185,7 +185,7 @@ func newAutomaticCDISpecModifier(logger logger.Interface, cfg *config.Config, de
 func generateAutomaticCDISpec(logger logger.Interface, cfg *config.Config, devices []string) (spec.Interface, error) {
 	cdilib, err := nvcdi.New(
 		nvcdi.WithLogger(logger),
-		nvcdi.WithNVIDIACTKPath(cfg.NVIDIACTKConfig.Path),
+		nvcdi.WithNVIDIACDIHookPath(cfg.NVIDIACTKConfig.Path),
 		nvcdi.WithDriverRoot(cfg.NVIDIAContainerCLIConfig.Root),
 		nvcdi.WithVendor("runtime.nvidia.com"),
 		nvcdi.WithClass("gpu"),
diff --git a/internal/modifier/csv.go b/internal/modifier/csv.go
index 54304428..0905d5da 100644
--- a/internal/modifier/csv.go
+++ b/internal/modifier/csv.go
@@ -62,7 +62,7 @@ func NewCSVModifier(logger logger.Interface, cfg *config.Config, image image.CUD
 	cdilib, err := nvcdi.New(
 		nvcdi.WithLogger(logger),
 		nvcdi.WithDriverRoot(cfg.NVIDIAContainerCLIConfig.Root),
-		nvcdi.WithNVIDIACTKPath(cfg.NVIDIACTKConfig.Path),
+		nvcdi.WithNVIDIACDIHookPath(cfg.NVIDIACTKConfig.Path),
 		nvcdi.WithMode(nvcdi.ModeCSV),
 		nvcdi.WithCSVFiles(csvFiles),
 	)
diff --git a/internal/platform-support/tegra/csv_test.go b/internal/platform-support/tegra/csv_test.go
index 2e4aca34..2e8e42fe 100644
--- a/internal/platform-support/tegra/csv_test.go
+++ b/internal/platform-support/tegra/csv_test.go
@@ -187,7 +187,6 @@ func TestDiscovererFromCSVFiles(t *testing.T) {
 
 			o := tegraOptions{
 				logger:              logger,
-				nvidiaCTKPath:       "/usr/bin/nvidia-ctk",
 				nvidiaCDIHookPath:   "/usr/bin/nvidia-cdi-hook",
 				csvFiles:            []string{"dummy"},
 				ignorePatterns:      tc.ignorePatterns,
diff --git a/internal/platform-support/tegra/tegra.go b/internal/platform-support/tegra/tegra.go
index 597d8b16..1031fc72 100644
--- a/internal/platform-support/tegra/tegra.go
+++ b/internal/platform-support/tegra/tegra.go
@@ -30,7 +30,6 @@ type tegraOptions struct {
 	csvFiles           []string
 	driverRoot         string
 	devRoot            string
-	nvidiaCTKPath      string
 	nvidiaCDIHookPath  string
 	ldconfigPath       string
 	librarySearchPaths []string
@@ -134,13 +133,6 @@ func WithCSVFiles(csvFiles []string) Option {
 	}
 }
 
-// WithNVIDIACTKPath sets the path to the nvidia-container-toolkit binary.
-func WithNVIDIACTKPath(nvidiaCTKPath string) Option {
-	return func(o *tegraOptions) {
-		o.nvidiaCTKPath = nvidiaCTKPath
-	}
-}
-
 // WithNVIDIACDIHookPath sets the path to the nvidia-cdi-hook binary.
 func WithNVIDIACDIHookPath(nvidiaCDIHookPath string) Option {
 	return func(o *tegraOptions) {
diff --git a/pkg/nvcdi/lib-csv.go b/pkg/nvcdi/lib-csv.go
index f6f652ba..649b801a 100644
--- a/pkg/nvcdi/lib-csv.go
+++ b/pkg/nvcdi/lib-csv.go
@@ -44,7 +44,6 @@ func (l *csvlib) GetAllDeviceSpecs() ([]specs.Device, error) {
 		tegra.WithLogger(l.logger),
 		tegra.WithDriverRoot(l.driverRoot),
 		tegra.WithDevRoot(l.devRoot),
-		tegra.WithNVIDIACTKPath(l.nvidiaCTKPath),
 		tegra.WithNVIDIACDIHookPath(l.nvidiaCDIHookPath),
 		tegra.WithLdconfigPath(l.ldconfigPath),
 		tegra.WithCSVFiles(l.csvFiles),
diff --git a/pkg/nvcdi/lib.go b/pkg/nvcdi/lib.go
index 31e08e82..c56c18e1 100644
--- a/pkg/nvcdi/lib.go
+++ b/pkg/nvcdi/lib.go
@@ -48,7 +48,6 @@ type nvcdilib struct {
 	deviceNamers       DeviceNamers
 	driverRoot         string
 	devRoot            string
-	nvidiaCTKPath      string
 	nvidiaCDIHookPath  string
 	ldconfigPath       string
 	configSearchPaths  []string
@@ -88,9 +87,6 @@ func New(opts ...Option) (Interface, error) {
 	if l.devRoot == "" {
 		l.devRoot = l.driverRoot
 	}
-	if l.nvidiaCTKPath == "" {
-		l.nvidiaCTKPath = "/usr/bin/nvidia-ctk"
-	}
 	if l.nvidiaCDIHookPath == "" {
 		l.nvidiaCDIHookPath = "/usr/bin/nvidia-cdi-hook"
 	}
diff --git a/pkg/nvcdi/options.go b/pkg/nvcdi/options.go
index a90450f6..5a490619 100644
--- a/pkg/nvcdi/options.go
+++ b/pkg/nvcdi/options.go
@@ -63,10 +63,10 @@ func WithLogger(logger logger.Interface) Option {
 }
 
 // WithNVIDIACTKPath sets the path to the NVIDIA Container Toolkit CLI path for the library
+//
+// Deprecated: Use WithNVIDIACDIHookPath instead.
 func WithNVIDIACTKPath(path string) Option {
-	return func(l *nvcdilib) {
-		l.nvidiaCTKPath = path
-	}
+	return WithNVIDIACDIHookPath(path)
 }
 
 // WithNVIDIACDIHookPath sets the path to the NVIDIA Container Toolkit CLI path for the library

Which switches to using nvidiaCDIHookPath everywhere -- except where we read nvidia-ctk.path config optionand pass that to the constructor. We also mark thenvcdi.WithNVIDIACTKPath` option as deprecated (although we could also not do so and just set the member).

The internal usage in the tegra package has also been changed to directly refer to nvidiaCDIHookPath instead. I don't think we need to deprecat that option since we don't consume this in any of our othe components.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense?

Sure, a clean alias, and an explanation. Added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

packaging/rpm/SPECS/nvidia-container-toolkit.spec Outdated Show resolved Hide resolved
@deitch deitch force-pushed the separate-hook-binary branch 2 times, most recently from 9fc3845 to cdc878b Compare May 7, 2024 15:43
@deitch
Copy link
Contributor Author

deitch commented May 7, 2024

All comments and suggestions added; thanks for them @elezar . Back to you.

@deitch
Copy link
Contributor Author

deitch commented May 14, 2024

Is anything else needed for this @elezar ?

@deitch
Copy link
Contributor Author

deitch commented May 14, 2024

Was a golang unit test error. Fixed.

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the comment below, we also need:

diff --git a/internal/discover/ldconfig_test.go b/internal/discover/ldconfig_test.go
index 67149845..0b214c77 100644
--- a/internal/discover/ldconfig_test.go
+++ b/internal/discover/ldconfig_test.go
@@ -25,8 +25,8 @@ import (
 )
 
 const (
-	testNvidiaCDIPath = "/foo/bar/nvidia-cdi-path"
-	testLdconfigPath  = "/bar/baz/ldconfig"
+	testNvidiaCDIHookPath = "/foo/bar/nvidia-cdi-hook"
+	testLdconfigPath      = "/bar/baz/ldconfig"
 )
 
 func TestLDCacheUpdateHook(t *testing.T) {
@@ -92,12 +92,12 @@ func TestLDCacheUpdateHook(t *testing.T) {
 				},
 			}
 			expectedHook := Hook{
-				Path:      testNvidiaCDIPath,
+				Path:      testNvidiaCDIHookPath,
 				Args:      tc.expectedArgs,
 				Lifecycle: "createContainer",
 			}
 
-			d, err := NewLDCacheUpdateHook(logger, mountMock, testNvidiaCDIPath, tc.ldconfigPath)
+			d, err := NewLDCacheUpdateHook(logger, mountMock, testNvidiaCDIHookPath, tc.ldconfigPath)
 			require.NoError(t, err)
 
 			hooks, err := d.Hooks()

@@ -186,6 +187,22 @@ func ResolveNVIDIACTKPath(logger logger.Interface, nvidiaCTKPath string) string
)
}

// ResolveNVIDIACDIHookPath resolves the path to the nvidia-cdi-hook binary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this and ResolveNVIDIACTKPath above again, what do you think of rewriting this as:

// ResolveNVIDIACTKPath resolves the path to the nvidia-ctk binary.
// This executable is used in hooks and needs to be an absolute path.
// If the path is specified as an absolute path, it is used directly
// without checking for existence of an executable at that path.
//
// Deprecated: Use ResolveNVIDIACDIHookPath directly instead.
func ResolveNVIDIACTKPath(logger logger.Interface, nvidiaCTKPath string) string {
	return resolveWithDefault(
		logger,
		"NVIDIA Container Toolkit CLI",
		nvidiaCTKPath,
		nvidiaCTKDefaultFilePath,
	)
}

// ResolveNVIDIACDIHookPath resolves the path to the nvidia-cdi-hook binary.
// This executable is used in hooks and needs to be an absolute path.
// If the path is specified as an absolute path, it is used directly
// without checking for existence of an executable at that path.
func ResolveNVIDIACDIHookPath(logger logger.Interface, nvidiaCDIHookPath string) string {
	if filepath.Base(nvidiaCDIHookPath) == "nvidia-ctk" {
		return resolveWithDefault(
			logger,
			"NVIDIA Container Toolkit CLI",
			nvidiaCDIHookPath,
			nvidiaCTKDefaultFilePath,
		)
	}
	return resolveWithDefault(
		logger,
		"NVIDIA CDI Hook CLI",
		nvidiaCDIHookPath,
		nvidiaCDIHookDefaultFilePath,
	)
}

and then adding:

	//nolint:staticcheck  // TODO(elezar): We should swith the nvidia-container-runtime from using nvidia-ctk to using nvidia-cdi-hook.
	cfg.NVIDIACTKConfig.Path = config.ResolveNVIDIACTKPath(r.logger, cfg.NVIDIACTKConfig.Path)

where we call ResolveNVIDIACTKPath?

We can work on properly handling the config changes for the runtime in a follow-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. Change made and pushed.

@deitch
Copy link
Contributor Author

deitch commented May 14, 2024

In addition to the comment below, we also need:

Yup already working on that one. Since it causes CI to fail, I get to see it. :-)

// This executable is used in hooks and needs to be an absolute path.
// If the path is specified as an absolute path, it is used directly
// without checking for existence of an executable at that path.
func ResolveNVIDIACDIHookPath(logger logger.Interface, nvidiaCDIHookPath string) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In responding to cncf-tags/container-device-interface#203 (comment) I had the thought that the concept of a CDI Hook executable could be standardized so that the nvidia-cdi-hook is a vendor-specific implementation. Does that mean that we should use cdiHookPath instead of using nvidiaCDIHookPath for our variables and config values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters all that much. Until such time as that actually happens, might as well get this in and iterate on it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. It can definitely be done as a follow-up.

internal/config/config.go Outdated Show resolved Hide resolved
@elezar
Copy link
Member

elezar commented May 17, 2024

One final pass @deitch. I have created deitch#1 with a proposal for a changelog addition and also adding a constructor for the hook commands to ensure that these are consistent across the two binaries.

I think with those in we can wrap this up.

@deitch
Copy link
Contributor Author

deitch commented May 17, 2024

@elezar do you mind changing deitch#1 to be a PR against the branch I used for this PR separate-hook-binary rather than against deitch:main? That way I can see just the changed files there and merge it into my branch?

@elezar
Copy link
Member

elezar commented May 17, 2024

@deitch updated.

@deitch
Copy link
Contributor Author

deitch commented May 17, 2024

Merged them in. Should probably kick off CI again here.

This change creates an nvidia-cdi-hook binary for implementing
CDI hooks. This allows for these hooks to be separated from the
nvidia-ctk command which may, for example, require libnvidia-ml
to support other functionality.

The nvidia-ctk hook subcommand is maintained as an alias for the
time being to allow for existing CDI specifications referring to
this path to work as expected.

Signed-off-by: Avi Deitcher <avi@deitcher.net>
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this @deitch.

I think this is cleaner than having the hooks as a subcommand of the nvidia-ctk -- especially given the discussion around cncf-tags/container-device-interface#203

@elezar elezar merged commit 7b988f1 into NVIDIA:main May 21, 2024
8 checks passed
@deitch deitch deleted the separate-hook-binary branch May 21, 2024 10:57
@deitch
Copy link
Contributor Author

deitch commented May 21, 2024

Thanks for the work on this @deitch

My pleasure. I am glad it worked out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider separating nvidia-ctk hook from other functions?
4 participants