-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UDF mode process_frame_stack(stack)
#1506
Comments
I have indeed thought about this before, the last time when working on the WDD UDF - there, compression is currently done frame-by-frame (so with two matrix-vector operations). It's clear that this can be improved upon by compressing multiple frames at once (which is then two matrix-matrix operations), but it was less clear to me how this would work for arbitrary tiles (it should be possible to "decompose" the compression such that is works on tiles, but it is of course much less complex to just work on full stacks of frames). /cc @bangunarya Correct, this I think we may want to start with a benchmark, to understand the performance better - one reason why the interface is the way it is currently is to kind-of match the I/O size to fit into the L3 cache, which works out for (some) individual frames, or deep-but-slim tiles, but probably no longer for stacks of full frames. Did you benchmark the fft2 case? It would be interesting to see the gains over |
Here is a benchmark for with code like this: for stack_size in stack_sizes:
for sig_shape in sig_shapes.keys():
warmup_data = xp.random.uniform(size=(stack_size, *sig_shape)).astype(xp.float32)
fft.fft2(warmup_data)
fft.fft2(warmup_data[0])
stack_data = xp.random.uniform(size=(stack_size, *sig_shape)).astype(xp.float32)
res_stack = %timeit -o fft.fft2(stack_data)
frame_data = [f.copy() for f in stack_data]
res_frame = %timeit -o [fft.fft2(f) for f in frame_data] For useful frame sizes, seems like the CPU implementations don't gain much or are worse when processing a stack of frames, while the CuPy implementation gains a lot (not surprising!). This doesn't account for the potential additional bonus of reading a whole batch of frames in one I/O operation, and less |
👍
I'm don't think this benchmark will give you correct results for the CuPy case - operations like
... which, on CPU, has to be weighted against falling off the L3 cache cliff.
I think a full perf benchmark would indeed be interesting, as there are many components at play. A quick way to hack this is to include a |
Thanks for the advice, am new to GPU perf benchmarking! Will adapt the code once I'm back from vacation and integrate it into a full UDF example while I'm at it 🤞 |
Hi @matbryan52 , @sk1p, so yeah it is very interesting if we can do full stacks of frames compression. I am wondering, why we could not directly implement broadcasting in this case, @sk1p? |
Here is the same graph but running
As before the results are only interesting on smaller frames, and then only really with I think for this to be implementable there would have to be a clear way for a UDF to have multiple implementations (tile, frame, stack, partition) and choose which to use at runtime based on the sig_shape and proposed partition size. It's hard to see how this could be done and also maintain the current system of tileshape negotiation (which is influenced by the dataset and the UDF(s)). Instead a pragmatic option could be a mode where (for performance and/or algorithmic reasons) a UDF can fully specify the tileshape it wants to receive at runtime, ignoring things like |
I don't really remember the details here, I think the conclusion was that the current per-frame method was already quite fast - maybe once Ptychography-4-0/ptychography#72 is merged, we can do a follow-up on performance, with both stack-of-frame or full partition considerations, and maybe have another go at a GPU version, too. |
Thanks for the updated graph! I didn't fully digest it yet, but one interesting outcome is the notable difference between
This seems to have some intersection with #1353 and related plans. One thought about multiple UDFs and compatibility/performance: right now, if we combine a An improvement on this would be to read with a by-frame scheme, use that to run the |
I seem to frequently encounter a situation where I am implementing a UDF which could quite easily be vectorized along a stack of whole frames. For example applying an FFT or some kind of filter always needs to have the whole frame at once, so i can't use
process_tile
but I could easily process a whole stack of frames in one operation. A lot of image processing functions can be applied to the last two axes of stack (e.g. numpy.fft.fft2) for (in some cases) free speedups. Particularly on GPU for convolution (think about (nBatch, h, w, nFeatures) in deep learning).I think this could essentially be forced even with the current API by playing with tileshape negotiation, i.e. implement
process_tile(tile)
but require tiles which are whole-frames tall/wide. It's a bit clunky, though. There might be space for aprocess_frame_stack
method, given that anyprocess_tile
function should automatically work as aprocess_frame[stack]()
. It would be similar toprocess_partition
but with some mechanism to restrict the size of a stack to less than a whole partition.How this could be concretely provided to the user I am not yet sure. I was wondering if this has ever been discussed or thought about ?
The text was updated successfully, but these errors were encountered: