Another attempt at supporting non-contiguous arrays #172

SyamGadde · 2018-03-01T18:56:48Z

This supercedes #171 after I polluted my original 'noncontig' branch with some commits. This squashes all the related updates into reasonably partitioned changes.

Issues that may be impacted/fixed by this include:
#15, #66, #121, #145, #151, #154, #162

Here is the original comment from the original commit:

Inspired by:
https://lists.tiker.net/pipermail/pycuda/2016-December/004986.html
https://gitlab.tiker.net/inducer/pycuda/merge_requests/1

I tried a new approach to supporting non-contiguous arrays in PyCUDA (could be ported to PyOpenCL somewhat easily I think). The goals (some elicited by the above discussion and comments in the WIP) were:

element-wise support for arbitrarily-strided arrays (including negative strides)
backwards compatibility with current uses of get_elwise_module and get_elwise_range_module
limited performance impact on contiguous arrays
avoid use of '*', '%', and '/' in calculating indices, at least within the element-wise loop

The only way I could think to support all those goals was to delay compilation (and source generation) to call-time, to take advantage of knowledge of input array strides. Contiguous arrays get the kernels that PyCUDA has always given them, non-contiguous arrays get specialized kernels. The nice thing about doing this is that the actual shape and strides can be sent as '#define's to aid compiler optimization (could even help with the contiguous kernels). The tricky thing about doing this is that some functions in the current implementation require the Module/Function before call-time, to get texrefs etc. So I basically implemented a Proxy class for SourceModule, called DeferredSourceModule which also defers the generation of the values created by get_function(), get_texref(), etc. until call-time.

To make this all work, indexing (for non-contiguous arrays at least) for an array A needs to be A[A_i], rather than A[i]. If it detects matching contiguous arrays as inputs, then X_i is '#define'd to be i, so kernels using the old method will still work (as long as the input arrays are contiguous and match in strides). No regexes needed to transform the user-specified kernel fragments, it's all directed by the user. Also, if you want to support non-contiguous arrays, you need to send the actual GPUArray objects, rather than their gpudata members to the call or prepare_ functions.

All existing tests succeed. More would probably need to be added if it made sense to integrate this into PyCUDA.

Positive side-effects: GPUArray.get() and GPUArray.copy() now work for arbitrarily sliced/strided arrays.

The performance hit for contiguous arrays is around 15% for modest-sized arrays (i.e. the 1000x100 array tested by Keegan in the above discussion) and, looking at profiling output, I think the hit is due to generating the key (in ElementwiseSourceModule.create_key()) for hashing cached kernels. This could probably be fixed. Performance for non-contiguous arrays is infinitely better, given that they weren't supported before, but I've seen a 40% slowdown over the contiguous version for the b1[::2,:-1]**2 test Keegan tried, due to the need to calculate indexes at each iteration of the loop. It tries to do this in a smart way, by pre-calculating the per-thread step for each dimension, and only using division/modulo to calculate the starting indices for each thread before the loop.

Independently of whether these changes are merged in, I will continue to use and develop them to support some local needs, so comments are welcome. I hope this is useful!

…avior of users of get_elwise_module or get_elwise_range_module, to test backwards compatibility and performance. This squashes multiple commits to put all major changes in one place.

…e it too.

…ctual GPUArrays and using ARRAY_i indices (rather than just i). Squashed multiple commits into one to make changes clearer.

bailsman · 2018-03-20T11:38:48Z

Hi SyamGadde: Is it possible to use constant memory with the new Deferred SourceModule?
I tried to add get_global here: #173

I think any memcpy to a constant memory symbol would happen before the module is called. Is that right? Would this make it impossible to have both support for non-contiguous arrays and support for constant memory in ElementwiseKernel at the same time?

I could try to implement a deferred memcpy to support that usecase, but I'm not sure if that's a good idea. The more 'magic' happens behind the scenes, the more unpredictable the performance becomes.

On the other hand, maybe people who want low level control should compile their own source modules anyway, and ElementwiseKernel is for convenience already.

Curious to know your thoughts.

bailsman · 2018-03-20T12:03:17Z

By the way, get_texrefs and get_module are not documented yet, so we could decide to remove those altogether if it makes adding noncontiguous support easier.

SyamGadde · 2018-03-20T14:55:59Z

Thanks for the comments!

To tell you the truth, I don't really understand texrefs and globals yet, I merely understood that there was a need internally for get_texref, and deferring the binding of texrefs was the easiest way to maintain the current "flow" (pre-generated callable objects that took (now) arbitrary array arguments). But I agree about introducing too much magic.

Are there use cases for get_global that I could look at to get a better idea how it might be used? If it truly is constant and wouldn't change for any version of the kernel, perhaps a set_global() call could be added, and the value would be stored and applied after compilation?

bailsman · 2018-03-20T15:58:06Z

I'm a beginner with CUDA. So far as I understand, globals work like this:

First you declare some data __constant__ in your kernel source code
Then (from host code) you call get_global() on the module, which gives you a device memory address.
Then you transfer data to that device address, e.g. with memcpy_htod. Now when the kernel uses the pointer that was declared __constant__ it will use the constant cache.

If I understand correctly, on modern devices the constant cache is not strictly necessary, because the normal caches will already be sufficient for most purposes. You can get a small benefit from the constant cache if your constants are constantly being pushed out of the other caches by lots of memory traffic. (Additionally, the constant cache is only useful if every thread in the warp accesses the same value)

If we get texture objects into PyCUDA, I don't think it would be really necessary to have get_texref and get_global on an ElementwiseKernel, since you could use texture objects for textures and normal global memory instead of constant memory. People who need lower level control could use normal SourceModules.

If removing those two functions makes implementing noncontiguous support simpler, I'd say we should go with that. Does it? Was get_texref the only thing requiring deferred sourcemodule?

SyamGadde · 2018-03-20T17:59:47Z

No -- the main reason for DeferredSourceModule was to defer compilation until the time when you call the module/kernel with actual array arguments, and then choose between a generic kernel (contiguous) or a custom kernel (non-contiguous). Supporting the deferred binding of texrefs added some complexity but is not the majority of deferred.py. I don't know enough about the costs/benefits of Textures in some of the elementwise-kernels to advise on how to remove them. My goal was to disturb Andreas' code as little as possible and to remain backwards-compatible, at least as a first stab.

inducer · 2018-04-16T15:16:33Z

Just wanted to report in to say that this hasn't fallen off my radar, but due to tenure-related crunch time at work, I'll have to push this out to mid-May (by when I have to submit my materials).

…iable name change from "slicer" to "flipper".

…er anyway).

@Profile

- move as much computation as possible out of create_key(), either to __init__() if it can be pre-computed without knowing shape/strides of args, or to create_source() otherwise. - add support for kernel to see the N-D indices via the do_indices keyword argument and the INDEX kernel variable - change all internal kernel variable names to all-caps and precede by underscore, aside from those variables likely to be used by custom kernels, such as indexing variables (i, A_i, etc.), NDIM, INDEX, total_threads (used by curandom) - allow user to explicitly specify the arguments (by index into arguments) that should be traversed elementwise as arrays - allow user to specify order (fortran or C) to traverse elementwise arrays - remove some accidentally committed @Profile decorators

…tructors.

…comes _eval) and documentation. Get rid of evalcont parameter in favor of explicit _set_mod() call.

… 0. This seems to allow some forms of broadcasting, though technically callers should be sending all elementwise arrays with the same shape (and valid strides) anyway.

…args specified by elwise_arg_inds be the same shape, eliminating the need for shape_arg_index.

Syam Gadde added 6 commits March 1, 2018 13:42

Commit only DeferredSourceModule support without changing calling beh…

5b2e0f6

…avior of users of get_elwise_module or get_elwise_range_module, to test backwards compatibility and performance. This squashes multiple commits to put all major changes in one place.

Smarter _new_like_me that handles discontiguous input. Have copy() us…

a1525de

…e it too.

Allow existing kernel calls to use non-contiguous arrays by sending a…

8522e7f

…ctual GPUArrays and using ARRAY_i indices (rather than just i). Squashed multiple commits into one to make changes clearer.

Allow setting scalars.

028c1a6

Fix kernel name.

97c6b8a

Fix _array_like_helper to work with non-contiguous arrays.

aa83f43

SyamGadde mentioned this pull request Mar 1, 2018

Another attempt at supporting non-contiguous arrays #171

Closed

Syam Gadde added 5 commits March 15, 2018 10:42

Minor, but important fixes to index calculations!

d50fa84

Let _memcpy_discontig fall back to generic assignment kernel.

a0b62fa

Fix arange for complex arguments.

5a9d7d2

Prettify/rearrange kernel source.

cf24d72

Fix contiguity-match test.

65562c2

Syam Gadde added 4 commits March 22, 2018 13:49

Make "slow" memcpy work for host arrays too.

83a95aa

Just in case there are any strides == 0.

0300f31

Fix memcpy_discontig_slow for host-side src.

42804f4

Fix some instances of incompatible strides.

2cb1a92

Syam Gadde added 8 commits May 29, 2018 11:03

Compile regexes once per source object.

540516e

Faster _flip_negative_strides (and earlier short-circuit), plus a var…

ca6434e

…iable name change from "slicer" to "flipper".

Be smarter/faster calculating keys for deferred source.

0ead8e2

Fix prototype.

eed20f8

Add tests for non-contiguous arrays.

6f09266

Don't need to transfer src to host.

a4cfd36

Ignore zero-strides in singleton dimensions (which will be culled lat…

ba984f4

…er anyway).

"precalc" performance improvements.

084a15d

Syam Gadde added 6 commits June 25, 2018 17:42

Don't delete a variable that is used later.

6effa4d

Allow DeferredSource to take a string in constructor.

f4bdd52

Fix string interpolation bug.

c437bcf

Small performance improvement.

637939f

Fix comment.

6442bb3

fjarri mentioned this pull request Jul 25, 2018

Creating padded arrays fjarri/reikna#35

Closed

Syam Gadde added 6 commits July 31, 2018 11:09

Interpolation fixes.

88f197c

Make sure casting works for both built-in types and classes with cons…

39d950a

…tructors.

Clean up DeferredVal implementation (_eval becomes _get, _evalbase be…

07022cd

…comes _eval) and documentation. Get rid of evalcont parameter in favor of explicit _set_mod() call.

Fix contig test when only one non-contiguous argument.

0411e5d

For elementwise arrays, always treat singleton dimensions as stride =…

886de8a

… 0. This seems to allow some forms of broadcasting, though technically callers should be sending all elementwise arrays with the same shape (and valid strides) anyway.

Rename array_arg_inds -> elwise_arg_inds, and require that all array …

c2413ef

…args specified by elwise_arg_inds be the same shape, eliminating the need for shape_arg_index.

Base automatically changed from master to main March 8, 2021 05:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Another attempt at supporting non-contiguous arrays #172

Another attempt at supporting non-contiguous arrays #172

SyamGadde commented Mar 1, 2018

bailsman commented Mar 20, 2018 •

edited

bailsman commented Mar 20, 2018

SyamGadde commented Mar 20, 2018

bailsman commented Mar 20, 2018 •

edited

SyamGadde commented Mar 20, 2018

inducer commented Apr 16, 2018

Another attempt at supporting non-contiguous arrays #172

Are you sure you want to change the base?

Another attempt at supporting non-contiguous arrays #172

Conversation

SyamGadde commented Mar 1, 2018

bailsman commented Mar 20, 2018 • edited

bailsman commented Mar 20, 2018

SyamGadde commented Mar 20, 2018

bailsman commented Mar 20, 2018 • edited

SyamGadde commented Mar 20, 2018

inducer commented Apr 16, 2018

bailsman commented Mar 20, 2018 •

edited

bailsman commented Mar 20, 2018 •

edited