chore: trying atomics and tree reduction for CUDA reducer kernels #3123

ManasviGoyal · 2024-05-16T15:51:39Z

Kernels tested for different block sizes

lgray · 2024-06-03T15:19:06Z

@ManasviGoyal I'm working on implementing https://github.com/CoffeaTeam/coffea-benchmarks/blob/master/coffea-adl-benchmarks.ipynb as much as possible in cuda kernels to start benchmarking realistic throughput.

We've already put together cupy based histograms that conform to HEP expectations, so we can nominally do full analysis workflows on the GPU.

@nsmith- will be working on a first-try at uproot-on-GPU using DMA over PCI-express.

I'll be working on a mock-up using parquet and cudf so we can understand the full workload's performance.

The first thing we're missing is the ability to slice arrays, which I understand from talking to Jim is intertwined with the reducer implementation. I'm happy to help test things in realistic use cases when you have implementations ready. Keep us in the loop and we'll be responsive!

ManasviGoyal · 2024-06-03T15:24:49Z

@ManasviGoyal I'm working on implementing https://github.com/CoffeaTeam/coffea-benchmarks/blob/master/coffea-adl-benchmarks.ipynb as much as possible in cuda kernels to start benchmarking realistic throughput.

We've already put together cupy based histograms that conform to HEP expectations, so we can nominally do full analysis workflows on the GPU.

@nsmith- will be working on a first-try at uproot-on-GPU using DMA over PCI-express.

I'll be working on a mock-up using parquet and cudf so we can understand the full workload's performance.

The first thing we're missing is the ability to slice arrays, which I understand from talking to Jim is intertwined with the reducer implementation. I'm happy to help test things in realistic use cases when you have implementations ready. Keep us in the loop and we'll be responsive!

Sure. I'll keep you updated. I am still figuring out how to handle some cases for reducers. Is there any specific kernels that you need first for slicing? I can prioritize them. The best was to test would be writing the test with arrays in cuda backend and see what error message it gives you. It would give you the name of the missing kernel that is needed for the function.

lgray · 2024-06-03T15:33:22Z

I only have access to virtualized GPUs (they are MIG-partitioned a100s at Fermilab) and for some reason instead of giving me an error it hangs forever! So that's a bit of a show stopper on my side.

As highest priority we would need boolean slicing and then as next highest priority we would need index-based slicing. After that we'll need argmin and argmax on the reducer side!

lgray · 2024-06-03T15:33:43Z

If you have a FNAL computing account I can help you reproduce the failure mode I am seeing.

ManasviGoyal · 2024-06-03T15:38:12Z

If you have a FNAL computing account I can help you reproduce the failure mode I am seeing.

I don't have a FNAL computing account. But for the current state it should give you a "kernel not implemented error". If you get any other errors, then the error might be because of a different reason. Maybe you can open an issue explaining the steps to reproduce and the error and I can check that on my GPU.

lgray · 2024-06-03T16:40:03Z

The major problem blocking a common a simple reproducer is that it involves setting up kubernetes and mounting a MIG-partitioned virtualized GPU into a container in order to get the faulty behavior. Some of these configuration options are not possible with a consumer GPU (particularly MIG partitioning), and I have no idea which component is causing the problem.

Do you have access to a cluster with such a setup through other means?

jpivarski · 2024-06-03T17:10:37Z

I thought the error we were talking about was just slicing ragged arrays:

>>> import awkward as ak
>>> array = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]], backend="cuda")
>>> array > 3
<Array [[False, False, True], [], [True, True]] type='3 * var * bool'>
>>> array[array > 3]

although this does give the expected "kernel not found" error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/awkward/src/awkward/highlevel.py", line 1065, in __getitem__
    prepare_layout(self._layout[where]),
                   ~~~~~~~~~~~~^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 512, in __getitem__
    return self._getitem(where)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 565, in _getitem
    return self._getitem(where.layout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 640, in _getitem
    return self._getitem((where,))
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/content.py", line 557, in _getitem
    out = next._getitem_next(nextwhere[0], nextwhere[1:], None)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/regulararray.py", line 698, in _getitem_next
    down = self._content._getitem_next_jagged(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/listoffsetarray.py", line 423, in _getitem_next_jagged
    return out._getitem_next_jagged(slicestarts, slicestops, slicecontent, tail)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/awkward/src/awkward/contents/listarray.py", line 546, in _getitem_next_jagged
    self._backend[
  File "/home/jpivarski/irishep/awkward/src/awkward/_backends/cupy.py", line 43, in __getitem__
    raise AssertionError(f"CuPyKernel not found: {index!r}")
AssertionError: CuPyKernel not found: ('awkward_ListArray_getitem_jagged_apply', <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>, <class 'numpy.int64'>)

I can ssh into pivarski@cmslpc-el8.fnal.gov and pivarski@cmslpc-el9.fnal.gov. Is this something that I could run and at least find the relevant parts for @ManasviGoyal to investigate?

ManasviGoyal · 2024-06-04T09:47:47Z

@lgray I have started working on slicing kernels along with reducers so that you can start testing. I tested the example @jpivarski gave in #3140 and it works. You can try and see if this simple example works in your GPU. There is one more slicing kernel that is left now (I still need to test the ones I have added more extensively). I will add rest of the kernels you mentioned as soon as possible.

>>> import awkward as ak
>>> array = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]], backend="cuda")
>>> array > 3
<Array [[False, False, True], [], [True, True]] type='3 * var * bool'>
>>> array[array > 3]
<Array [[3.3], [], [4.4, 5.5]] type='3 * var * float64'>

lgray · 2024-06-04T12:45:36Z

awesome!

lgray · 2024-06-04T12:46:10Z

Ah - also - combinations / argcombinations is relatively high priority as well.

lgray · 2024-06-04T12:46:48Z

@jpivarski the error I was talking about should be that error. Instead the process hangs indefinitely with no error emitted.

lgray · 2024-06-06T02:43:28Z

@ManasviGoyal to make the prioritization a little bit more clear, you can use this set of analysis functionality benchmarks:
https://github.com/CoffeaTeam/coffea-benchmarks/blob/master/coffea-adl-benchmarks.ipynb

These are what I am currently using to see what's possible on GPU.

Since this PR isn't merged in yet, and we found some other issues today, I'm currently just finished with Query 3. Query 4 requires this PR since it contains a reduction that you've already implemented.

You can more or less look for the various awkward operations in these functionality tests and prioritize by that ordering what is needed!

ManasviGoyal · 2024-06-06T06:58:03Z

@ManasviGoyal to make the prioritization a little bit more clear, you can use this set of analysis functionality benchmarks: https://github.com/CoffeaTeam/coffea-benchmarks/blob/master/coffea-adl-benchmarks.ipynb

These are what I am currently using to see what's possible on GPU.

Since this PR isn't merged in yet, and we found some other issues today, I'm currently just finished with Query 3. Query 4 requires this PR since it contains a reduction that you've already implemented.

You can more or less look for the various awkward operations in these functionality tests and prioritize by that ordering what is needed!

Thanks! This helps a lot in prioritizing the kernels. I'll finish up with all the reducers soon and start combinations.

lgray · 2024-06-06T13:42:52Z

I was also trying to check only this PR on the sum memory usage I brought up over in #3136 but it seems it's not actually implemented yet here. Looking in the files for awkward_reduce_sum_int32_bool_64 and awkward_reduce_sum_int64_bool_64 it seems those only reimplement awkward_reduce_countnonzero, probably just a simple mistake.

But I can't really proceed with checking due to that.

chore: trying atomics and tree reduction for CUDA reducer kernels

1e1b8ea

ManasviGoyal marked this pull request as draft May 16, 2024 15:51

ManasviGoyal added the gpu Concerns the GPU implementation (backend = "cuda') label May 16, 2024

ManasviGoyal temporarily deployed to docs May 16, 2024 15:59 — with GitHub Actions Inactive

chore: add prod, min, max

f79b2ae

ManasviGoyal temporarily deployed to docs May 16, 2024 16:11 — with GitHub Actions Inactive

fix: some fixes

49da4f2

ManasviGoyal temporarily deployed to docs May 17, 2024 07:42 — with GitHub Actions Inactive

fix: handle block boundaries

80f91c9

ManasviGoyal temporarily deployed to docs May 21, 2024 12:15 — with GitHub Actions Inactive

chore: add argmin and argmax

f000fd8

ManasviGoyal temporarily deployed to docs May 21, 2024 12:34 — with GitHub Actions Inactive

chore: add sum and max complex

c040a03

ManasviGoyal temporarily deployed to docs May 21, 2024 14:28 — with GitHub Actions Inactive

ManasviGoyal added 2 commits May 22, 2024 10:27

chore: add sum and prod bool

a019797

chore: add count kernels

10c0a6c

ManasviGoyal temporarily deployed to docs May 22, 2024 09:54 — with GitHub Actions Inactive

chore: add sum int32 and int64 bool kernels

fbb1195

ManasviGoyal temporarily deployed to docs May 22, 2024 10:23 — with GitHub Actions Inactive

chore: add sum and prod complex

aa507a2

ManasviGoyal temporarily deployed to docs May 23, 2024 13:05 — with GitHub Actions Inactive

Merge branch 'main' into ManasviGoyal/reducers-study

f2a7775

ManasviGoyal deployed to docs June 4, 2024 09:58 — with GitHub Actions View deployment

lgray mentioned this pull request Jun 6, 2024

feat: add reduce kernels #3136

Draft

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: trying atomics and tree reduction for CUDA reducer kernels #3123

chore: trying atomics and tree reduction for CUDA reducer kernels #3123

ManasviGoyal commented May 16, 2024 •

edited

lgray commented Jun 3, 2024

ManasviGoyal commented Jun 3, 2024

lgray commented Jun 3, 2024

lgray commented Jun 3, 2024

ManasviGoyal commented Jun 3, 2024 •

edited

lgray commented Jun 3, 2024

jpivarski commented Jun 3, 2024

ManasviGoyal commented Jun 4, 2024 •

edited

lgray commented Jun 4, 2024

lgray commented Jun 4, 2024

lgray commented Jun 4, 2024

lgray commented Jun 6, 2024 •

edited

ManasviGoyal commented Jun 6, 2024

lgray commented Jun 6, 2024

chore: trying atomics and tree reduction for CUDA reducer kernels #3123

Are you sure you want to change the base?

chore: trying atomics and tree reduction for CUDA reducer kernels #3123

Conversation

ManasviGoyal commented May 16, 2024 • edited

lgray commented Jun 3, 2024

ManasviGoyal commented Jun 3, 2024

lgray commented Jun 3, 2024

lgray commented Jun 3, 2024

ManasviGoyal commented Jun 3, 2024 • edited

lgray commented Jun 3, 2024

jpivarski commented Jun 3, 2024

ManasviGoyal commented Jun 4, 2024 • edited

lgray commented Jun 4, 2024

lgray commented Jun 4, 2024

lgray commented Jun 4, 2024

lgray commented Jun 6, 2024 • edited

ManasviGoyal commented Jun 6, 2024

lgray commented Jun 6, 2024

ManasviGoyal commented May 16, 2024 •

edited

ManasviGoyal commented Jun 3, 2024 •

edited

ManasviGoyal commented Jun 4, 2024 •

edited

lgray commented Jun 6, 2024 •

edited