Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to precompile cuRand and gpu array functions #311

Open
carsonswope opened this issue Aug 27, 2021 · 2 comments
Open

Option to precompile cuRand and gpu array functions #311

carsonswope opened this issue Aug 27, 2021 · 2 comments

Comments

@carsonswope
Copy link

Hi,

I want to remove the requirement to have MSVC and NVCC compilers available in the runtime environment so I can distribute a program I'm writing in pycuda. I've managed to compile my custom kernels into .fatbin files and import them using module_from_buffer.

However, it looks like some other pycuda functions still rely on generating and compiling cuda kernels at runtime. Specifically I'm having trouble with the cu_rand integration, as well as gpu_array.fill(x) function. Presumably a lot more of the gpu_array helper functions will have the same problem.

Is there a way to package the kernels used by these functions into .fatbin files, and to rely on those files rather than runtime compilation? and/or what code changes would be required to pycuda to support this?

Thanks!

@inducer
Copy link
Owner

inducer commented Aug 27, 2021

I suspect the most fruitful approach would be to modify the kernel caching layer to support this. Maybe allow setting a mode where all used kernels can be "recorded". (This would have to happen at context creation time, otherwise some kernels may already be loaded and might get missed.) This recording would then generate the appropriate fat binary files that can be shipped with an application, likely stored as a cache (which would have to be free of collisions) based on the provided source code. IMO, this would allow for minimal interface changes on the application side while avoiding a hard dependency on the compiler at runtime.

You could also revive this NVRTC patch set and base your work on that, then you'd only need to save PTX. (though, to be fair, the generated PTX might/will still vary by architecture)

@carsonswope
Copy link
Author

carsonswope commented Aug 30, 2021

Thanks for the quick response! I will look into setting up a kernel caching layer. Ideally it would also work the same for custom kernels compiled with SourceModule. The only issue I'm seeing with having the cache being keyed off the provided source code is usage of include statements / and other preprocessor directives. I'm including the C++ library GLM as well as a few of my own utils.hpp files in my kernels, so if anything in those dependencies change then the cache wouldn't be properly cleared unless the cache is keyed off of the full preprocessor output, but that would require access to the compiler, which is what we are trying to avoid. Seems like an edge case but wondering if you have any ideas about that.

[edit: okay, nevermind about the cache 'busting' logic. The important functionality is less of a cache and more just being able to record and store binaries for all kernels compiled by a given application.]

Is this something you'd be interested in accepting as a PR?

(NVRTC looks interesting, but still requires users to have nvcc installed, which makes it not ideal for me)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants