Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: trying atomics and tree reduction for CUDA reducer kernels #3123

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

ManasviGoyal
Copy link
Collaborator

@ManasviGoyal ManasviGoyal commented May 16, 2024

Kernels tested for different block sizes

  • awkward_reduce_argmax
  • awkward_reduce_argmax_bool_64
  • awkward_reduce_argmin
  • awkward_reduce_argmin_bool_64
  • awkward_reduce_count_64
  • awkward_reduce_countnonzero
  • awkward_reduce_max
  • awkward_reduce_min
  • awkward_reduce_prod_bool
  • awkward_reduce_sum
  • awkward_reduce_sum_bool
  • awkward_reduce_sum_int32_bool_64
  • awkward_reduce_sum_int64_bool_64
  • awkward_reduce_argmax_complex
  • awkward_reduce_argmin_complex
  • awkward_reduce_countnonzero_complex
  • awkward_reduce_max_complex
  • awkward_reduce_min_complex
  • awkward_reduce_prod
  • awkward_reduce_prod_bool_complex
  • awkward_reduce_prod_complex
  • awkward_reduce_sum_bool_complex
  • awkward_reduce_sum_complex

@ManasviGoyal ManasviGoyal marked this pull request as draft May 16, 2024 15:51
@ManasviGoyal ManasviGoyal added the gpu Concerns the GPU implementation (backend = "cuda') label May 16, 2024
@lgray
Copy link
Contributor

lgray commented Jun 3, 2024

@ManasviGoyal I'm working on implementing https://github.com/CoffeaTeam/coffea-benchmarks/blob/master/coffea-adl-benchmarks.ipynb as much as possible in cuda kernels to start benchmarking realistic throughput.

We've already put together cupy based histograms that conform to HEP expectations, so we can nominally do full analysis workflows on the GPU.

@nsmith- will be working on a first-try at uproot-on-GPU using DMA over PCI-express.

I'll be working on a mock-up using parquet and cudf so we can understand the full workload's performance.

The first thing we're missing is the ability to slice arrays, which I understand from talking to Jim is intertwined with the reducer implementation. I'm happy to help test things in realistic use cases when you have implementations ready. Keep us in the loop and we'll be responsive!

@ManasviGoyal
Copy link
Collaborator Author

@ManasviGoyal I'm working on implementing https://github.com/CoffeaTeam/coffea-benchmarks/blob/master/coffea-adl-benchmarks.ipynb as much as possible in cuda kernels to start benchmarking realistic throughput.

We've already put together cupy based histograms that conform to HEP expectations, so we can nominally do full analysis workflows on the GPU.

@nsmith- will be working on a first-try at uproot-on-GPU using DMA over PCI-express.

I'll be working on a mock-up using parquet and cudf so we can understand the full workload's performance.

The first thing we're missing is the ability to slice arrays, which I understand from talking to Jim is intertwined with the reducer implementation. I'm happy to help test things in realistic use cases when you have implementations ready. Keep us in the loop and we'll be responsive!

Sure. I'll keep you updated. I am still figuring out how to handle some cases for reducers. Is there any specific kernels that you need first for slicing? I can prioritize them. The best was to test would be writing the test with arrays in cuda backend and see what error message it gives you. It would give you the name of the missing kernel that is needed for the function.

@lgray
Copy link
Contributor

lgray commented Jun 3, 2024

I only have access to virtualized GPUs (they are MIG-partitioned a100s at Fermilab) and for some reason instead of giving me an error it hangs forever! So that's a bit of a show stopper on my side.

As highest priority we would need boolean slicing and then as next highest priority we would need index-based slicing. After that we'll need argmin and argmax on the reducer side!

@lgray
Copy link
Contributor

lgray commented Jun 3, 2024

If you have a FNAL computing account I can help you reproduce the failure mode I am seeing.

@ManasviGoyal
Copy link
Collaborator Author

ManasviGoyal commented Jun 3, 2024

If you have a FNAL computing account I can help you reproduce the failure mode I am seeing.

I don't have a FNAL computing account. But for the current state it should give you a "kernel not implemented error". If you get any other errors, then the error might be because of a different reason. Maybe you can open an issue explaining the steps to reproduce and the error and I can check that on my GPU.

@lgray
Copy link
Contributor

lgray commented Jun 3, 2024

The major problem blocking a common a simple reproducer is that it involves setting up kubernetes and mounting a MIG-partitioned virtualized GPU into a container in order to get the faulty behavior. Some of these configuration options are not possible with a consumer GPU (particularly MIG partitioning), and I have no idea which component is causing the problem.

Do you have access to a cluster with such a setup through other means?

@jpivarski
Copy link
Member

I thought the error we were talking about was just slicing ragged arrays:

>>> import awkward as ak
>>> array = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]], backend="cuda")
>>> array > 3
<Array [[False, False, True], [], [True, True]] type='3 * var * bool'>
>>> array[array > 3]

although this does give the expected "kernel not found" error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/awkward/src/awkward/highlevel.py", line 1065, in __getitem__
    prepare_layout(self._layout[where]),
                   ~~~~~~~~~~~~^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 512, in __getitem__
    return self._getitem(where)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 565, in _getitem
    return self._getitem(where.layout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 640, in _getitem
    return self._getitem((where,))
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 557, in _getitem
    out = next._getitem_next(nextwhere[0], nextwhere[1:], None)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/regulararray.py", line 698, in _getitem_next
    down = self._content._getitem_next_jagged(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/listoffsetarray.py", line 423, in _getitem_next_jagged
    return out._getitem_next_jagged(slicestarts, slicestops, slicecontent, tail)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/listarray.py", line 546, in _getitem_next_jagged
    self._backend[
  File "/home/jpivarski/irishep/awkward/src/awkward/_backends/cupy.py", line 43, in __getitem__
    raise AssertionError(f"CuPyKernel not found: {index!r}")
AssertionError: CuPyKernel not found: ('awkward_ListArray_getitem_jagged_apply', <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>)

I can ssh into pivarski@cmslpc-el8.fnal.gov and pivarski@cmslpc-el9.fnal.gov. Is this something that I could run and at least find the relevant parts for @ManasviGoyal to investigate?

@ManasviGoyal
Copy link
Collaborator Author

ManasviGoyal commented Jun 4, 2024

@lgray I have started working on slicing kernels along with reducers so that you can start testing. I tested the example @jpivarski gave in #3140 and it works. You can try and see if this simple example works in your GPU. There is one more slicing kernel that is left now (I still need to test the ones I have added more extensively). I will add rest of the kernels you mentioned as soon as possible.

>>> import awkward as ak
>>> array = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]], backend="cuda")
>>> array > 3
<Array [[False, False, True], [], [True, True]] type='3 * var * bool'>
>>> array[array > 3]
<Array [[3.3], [], [4.4, 5.5]] type='3 * var * float64'>

@lgray
Copy link
Contributor

lgray commented Jun 4, 2024

awesome!

@lgray
Copy link
Contributor

lgray commented Jun 4, 2024

Ah - also - combinations / argcombinations is relatively high priority as well.

@lgray
Copy link
Contributor

lgray commented Jun 4, 2024

@jpivarski the error I was talking about should be that error. Instead the process hangs indefinitely with no error emitted.

@lgray
Copy link
Contributor

lgray commented Jun 6, 2024

@ManasviGoyal to make the prioritization a little bit more clear, you can use this set of analysis functionality benchmarks:
https://github.com/CoffeaTeam/coffea-benchmarks/blob/master/coffea-adl-benchmarks.ipynb

These are what I am currently using to see what's possible on GPU.

Since this PR isn't merged in yet, and we found some other issues today, I'm currently just finished with Query 3. Query 4 requires this PR since it contains a reduction that you've already implemented.

You can more or less look for the various awkward operations in these functionality tests and prioritize by that ordering what is needed!

@ManasviGoyal
Copy link
Collaborator Author

@ManasviGoyal to make the prioritization a little bit more clear, you can use this set of analysis functionality benchmarks: https://github.com/CoffeaTeam/coffea-benchmarks/blob/master/coffea-adl-benchmarks.ipynb

These are what I am currently using to see what's possible on GPU.

Since this PR isn't merged in yet, and we found some other issues today, I'm currently just finished with Query 3. Query 4 requires this PR since it contains a reduction that you've already implemented.

You can more or less look for the various awkward operations in these functionality tests and prioritize by that ordering what is needed!

Thanks! This helps a lot in prioritizing the kernels. I'll finish up with all the reducers soon and start combinations.

@lgray lgray mentioned this pull request Jun 6, 2024
13 tasks
@lgray
Copy link
Contributor

lgray commented Jun 6, 2024

I was also trying to check only this PR on the sum memory usage I brought up over in #3136 but it seems it's not actually implemented yet here. Looking in the files for awkward_reduce_sum_int32_bool_64 and awkward_reduce_sum_int64_bool_64 it seems those only reimplement awkward_reduce_countnonzero, probably just a simple mistake.

But I can't really proceed with checking due to that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpu Concerns the GPU implementation (backend = "cuda')
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants