Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider separating nvidia-ctk hook from other functions? #435

Closed
deitch opened this issue Apr 2, 2024 · 9 comments · Fixed by #474
Closed

Consider separating nvidia-ctk hook from other functions? #435

deitch opened this issue Apr 2, 2024 · 9 comments · Fixed by #474

Comments

@deitch
Copy link
Contributor

deitch commented Apr 2, 2024

Would the team consider separating nvidia-ctk hook from other functions?

The rationale is as follows.

The hook behaviours - currently chmod, create-symlinks, update-ldcache - run standard behaviours. They do not depend on anything GPU-specific or libnvidia-ml.so or CUDA in any way. The action of generating a CDI for a particular device (preparation), and the act of performing hook functions (runtime) are distinct, and can be executed at different times. I might even use a corporate or other non-JP OS which is not Ubuntu-based, generate my CDI inside a container, save the CDI yaml, and then a different process (which does not have access to the correct versions glibc or libnvidia-ml.so or other dependencies) would run the hooks.

This would allow me to separate those two functionalities. It also would make the hook tool build much simpler.

One final advantage is understandability. When I run nvidia-ctk cdi generate, I have the option --nvidia-ctkl-path <path>, which really means "what path to nvidia-ctk should I place in the CDI yaml?" This can be confusing, since, aren't I executing nvidia-ctk right now? What does it mean to change my path to... myself?

If we separate it, e.g. nvidia-cdi-hook, then it becomes easier to understand: nvidia-ctk cdi generate --nvidia-cdi-hook-path /path/to/nvidia-cdi-hook.

I would be fine opening a PR for it, but not going to waste the time if the maintainers don't want it.

@deitch
Copy link
Contributor Author

deitch commented Apr 8, 2024

Any feedback on this?

@elezar
Copy link
Member

elezar commented Apr 8, 2024

This does sound feasible. We would have to provide a migration path for where nvidia-ctk is used directly.

@klueska what are your thoughts?

@deitch
Copy link
Contributor Author

deitch commented Apr 8, 2024

The binary is not all that big, the functionality that is handled in the hooks part is quite small. I was thinking we could duplicate it - leave it in both nvidia-ctk and the new binary for a whole.

@deitch
Copy link
Contributor Author

deitch commented Apr 19, 2024

Hi, coming back to this. I still have interest, and am willing to get it done. Let me know?

@klueska
Copy link
Contributor

klueska commented Apr 22, 2024

Just to make sure I understand -- the proposal is to:

  1. Ship an additional binary in nvidia-container-toolkit-base called nvidia-cdi-hook
  2. This new binary essentially replaces nvidia-ctk hook <args> with nvidia-cdi-hook <args>
  3. For now, we would leave the duplicated functionality in nvidia-ctk hook, but may consider removing it later.

With the primary reason for this being that (at least today) none of the hooks rely on having an nvidia driver installed in order to run. And the secondary reason being for readability (i.e. point to a specific --hook-path not back at the full nvidia-ctk path directly).

This seems reasonable to me -- with the second argument actually being the stronger one from my perspective. Mostly because I don't want to forbid us from creating a hook at some point in the future that does rely on calling into e.g. NVML if necessary.

@deitch
Copy link
Contributor Author

deitch commented Apr 22, 2024

Ship an additional binary in nvidia-container-toolkit-base called nvidia-cdi-hook

Yes, although I have no opinion on the name, whatever works is fine with me.

This new binary essentially replaces nvidia-ctk hook with nvidia-cdi-hook

Yes.

For now, we would leave the duplicated functionality in nvidia-ctk hook, but may consider removing it later.

Yes. I would extract it all into a go pkg, and then both can include it and avoid much duplication.

Mostly because I don't want to forbid us from creating a hook at some point in the future that does rely on calling into e.g. NVML if necessary.

That is a good point. I think it might be easier and cleaner, if we ever get there, to have nvidia-cdi-hook call nvidia-ctk, but I could be wrong. In any case, that is a problem for the future.

I will see if I can find time in the next week or two to put together a draft PR.

@deitch
Copy link
Contributor Author

deitch commented May 3, 2024

As I work on that open PR - which is ready in theory, although has issues that are in main as well, we are stuck on that part; help is appreciated - it occurs to me that the 3 things that nvidia-cdi-hook does are, well, basic:

  • create symlinks (in the context of a container)
  • chmod (container root prepended to target paths)
  • update ldcache (run ldconfig in the chroot of the container)

Would it make sense for these things to be "basic" CDI services? The CDI spec is here, includes container edit things like mounts, env, deviceNodes, additionalGIDs, and of course hooks. I think these functions, or at least some of them, would make sense to be first-level capabilities of containerEdits. Thoughts?

@elezar
Copy link
Member

elezar commented May 3, 2024

@deitch extending the CDI specification to handle these operations as first-class concepts may make sense. The reason these are currently run as hooks is that they depend on the container root which is not available at the point of CDI spec generation. Would you create an issue against the CDI repository and I can raise this feature request at a future working group meeting.

@deitch
Copy link
Contributor Author

deitch commented May 3, 2024

Hi Evan, certainly. Just did it, right after you posted your response. See this CDI issue.

That doesn't eliminate the need for the PR, which I would like to get through. Until the spec extension gets approved (if it gets approved), and it gets in, and high-level container runtimes like containerd support it, it can be a long time.

Do you mind commenting on that #474 , if you have any understanding why main is missing things that v1.15.0 is not?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants