Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial port of HECO auto-SIMD passes for BGV #471

Merged
merged 1 commit into from Mar 11, 2024

Conversation

j2kun
Copy link
Collaborator

@j2kun j2kun commented Feb 29, 2024

Partial progress toward #475

@j2kun j2kun marked this pull request as draft February 29, 2024 23:57
@j2kun j2kun force-pushed the simd-packing branch 2 times, most recently from 89fd01b to 6eb3b9e Compare March 2, 2024 22:28
@j2kun
Copy link
Collaborator Author

j2kun commented Mar 3, 2024

Some notes:

With this PR as of b1d3eec, I can use the following pipeline to fully unroll and vectorize the box_blur program from HECO:

bazel run //tools:heir-opt --  --secretize=entry-function=box_blur --wrap-generic --canonicalize --cse --full-loop-unroll --straight-line-vectorize $PWD/tests/simd/box_blur.mlir

However, after the full loop unroll, the program has some 500k lines, and if you try to canonicalize or cse, MLIR will hang (I let it run for ~3h and it didn't finish).

I also tried the following variant that, although it's specially crafted for box_blur, uses the existing MLIR passes (not my custom full-loop-unroll) to do the same thing (ruling out that my loop unroll is buggy)

bazel run //tools:heir-opt --  --secretize=entry-function=box_blur --wrap-generic --canonicalize --cse --affine-loop-unroll=unroll-factor=2 --affine-loop-unroll=unroll-factor=2 --canonicalize --cse --affine-loop-unroll=unroll-factor=64 --canonicalize --cse --affine-loop-unroll=unroll-factor=64 --canonicalize --cse $PWD/tests/simd/box_blur.mlir

If you remove the last three flags, and run

bazel run //tools:heir-opt --  --secretize=entry-function=box_blur --wrap-generic --canonicalize --cse --affine-loop-unroll=unroll-factor=2 --affine-loop-unroll=unroll-factor=2 --canonicalize --cse --affine-loop-unroll=unroll-factor=64 --canonicalize --cse $PWD/tests/simd/box_blur.mlir

Then you get a more reasonable program: an affine loop of size 64 with 900-ish statements in the body: https://gist.github.com/j2kun/3d3bf0d41b1f503f42ecd54753b2bd31

The checked-in HECO MLIR files seem to be working with a smaller dimension than 64x64 (cf. https://github.com/MarbleHE/HECO/blob/main/test/filecheck/boxblur/boxblur_2_unrolled.mlir which has only ~1.1k ops compared to the ~100k that a full unroll gives when I run it). Perhaps @AlexanderViand-Intel can comment more on this discrepancy.

Either way, I think it is safe to say that if we want to test a BGV-to-OpenFHE path, it would make sense to stop short of a full unroll. But then we need some heuristic to determine when to stop unrolling.

After the above, applying --straight-line-vectorize gives this IR: https://gist.github.com/j2kun/a9d52c0bf0c5228dcb14f4571ed1f68c

Which, excluding all the tensor ops, reduces to:

module {
  func.func @box_blur(%arg0: !secret.secret<tensor<4096xi16>>) -> !secret.secret<tensor<4096xi16>> {
    %0 = secret.generic ins(%arg0 : !secret.secret<tensor<4096xi16>>) {
    ^bb0(%arg1: tensor<4096xi16>):
      %1 = affine.for %arg2 = 0 to 64 iter_args(%arg3 = %arg1) -> (tensor<4096xi16>) {
        %2 = arith.addi %arg2, %c-1 : index
        %3 = arith.muli %from_elements, %cst_3 : tensor<2xindex>
        %4 = arith.addi %arg2, %c1 : index
        %5 = arith.muli %4, %c64 : index
        %6 = arith.addi %from_elements_5, %cst_2 : tensor<126xindex>
        %7 = arith.remui %from_elements_132, %cst_1 : tensor<128xindex>
        %8 = arith.addi %from_elements_292, %cst_0 : tensor<63xindex>
        %9 = arith.remui %from_elements_356, %cst : tensor<64xindex>
        %10 = arith.addi %extracted, %c63 : index
        %11 = arith.remui %10, %c4096 : index
        %12 = arith.addi %extracted_4, %c63 : index
        %13 = arith.remui %12, %c4096 : index
        %14 = arith.addi %5, %c63 : index
        %15 = arith.remui %14, %c4096 : index
        %16 = arith.addi %extracted, %c64 : index
        %17 = arith.remui %16, %c4096 : index
        %18 = arith.addi %extracted_4, %c64 : index
        %19 = arith.remui %18, %c4096 : index
        %20 = arith.addi %5, %c64 : index
        %21 = arith.remui %20, %c4096 : index
        %22 = arith.addi %from_elements_588, %from_elements_589 : tensor<64xi16>
        %23 = arith.addi %from_elements_660, %from_elements_661 : tensor<64xi16>
        %24 = arith.addi %from_elements_726, %from_elements_654 : tensor<64xi16>
        %25 = arith.addi %from_elements_791, %from_elements_655 : tensor<64xi16>
        %26 = arith.addi %from_elements_856, %from_elements_656 : tensor<64xi16>
        %27 = arith.addi %from_elements_921, %from_elements_657 : tensor<64xi16>
        %28 = arith.addi %from_elements_986, %from_elements_658 : tensor<64xi16>
        %29 = arith.addi %from_elements_1051, %from_elements_659 : tensor<64xi16>
        affine.yield %inserted_1178 : tensor<4096xi16>
      }
      secret.yield %1 : tensor<4096xi16>
    } -> !secret.secret<tensor<4096xi16>>
    return %0 : !secret.secret<tensor<4096xi16>>
  }
}

As you can see, most of the ops naturally batch into size 64, but at the beginning some don't, and we'd need to accommodate for that in the --straight-line-vectorize pass by giving the pass an allowed vector width. (Which itself raises the question of how to set the batch size at this stage, long before cryptographic parameters are decided)

@AlexanderViand-Intel
Copy link
Collaborator

Mh, the unrolling issues are odd! HECO's evaluation contains this very similar 64*64 roberts cross example which compiles in a few seconds (definitively not hours). I'll try and look into that before our meeting this week, but I don't see any obvious differences in what the HEIR version and the HECO version are doing that would explain this.

Btw, it's probably worth noting that HECO's heuristics conceptually don't require the unrolling, that's just a shortcut to simplify the implementation. The relative offset/target slot logic should all work as affine loop analyses, too.

@j2kun
Copy link
Collaborator Author

j2kun commented Mar 4, 2024

Update:

I think we can do this entirely with DRR-generated patterns, but I found a bug in the way they handle variadic inputs, which is required to work with tensor extract ops. Jacques said he thinks he has a quick fix and will address it soon. https://discourse.llvm.org/t/compilation-failure-with-drr-generated-pattern/77385

In the mean time, the rotation alignment patterns seem to work, and while this PR needs a lot of cleanup, I think we can get something in before Thursday.

@j2kun
Copy link
Collaborator Author

j2kun commented Mar 4, 2024

Btw, it's probably worth noting that HECO's heuristics conceptually don't require the unrolling, that's just a shortcut to simplify the implementation. The relative offset/target slot logic should all work as affine loop analyses, too.

I would love to try to build a more general affine loop analysis instead of unrolling, I'm just not quite sure how to do that.

@AlexanderViand-Intel
Copy link
Collaborator

However, after the full loop unroll, the program has some 500k lines, and if you try to canonicalize or cse, MLIR will hang (I let it run for ~3h and it didn't finish).

Do you still experience the time-out issue with unrolling the 64*64 boxblur? I just reinstalled HECO and ran boxblur_1_ssa.mlir through the full pipeline (heco --full-pass boxblur_1_ssa.mlir) and it took less than a second if LLVM is built in Release mode (admittedly on a server with many cores, but this should take no more than 30s even on a laptop) and 27s if LLVM is built in Debug mode (again, on a powerful server, but that should be at most a few minutes, not hours).
I tried to have a look at what's going on in HEIR here, but the PR currently fails to build for me (and also in CI, so I guess it's not my machine).

The checked-in HECO MLIR files seem to be working with a smaller dimension than 64x64 (cf. https://github.com/MarbleHE/HECO/blob/main/test/filecheck/boxblur/boxblur_2_unrolled.mlir which has only ~1.1k ops compared to the ~100k that a full unroll gives when I run it). Perhaps @AlexanderViand-Intel can comment more on this discrepancy.

The boxblur example is indeed a 64x64 program: doing just (--unroll-loops) gives you a file with just over 300'000 lines, but --canonicalize --cse then simplifies back down to the ~1k lines in boxblur_2_unrolled.mlir (this really highlights how wasteful a "hack" the loop unrolling is).

Btw, it's probably worth noting that HECO's heuristics conceptually don't require the unrolling, that's just a shortcut to simplify the implementation. The relative offset/target slot logic should all work as affine loop analyses, too.

I would love to try to build a more general affine loop analysis instead of unrolling, I'm just not quite sure how to do that.

It's definitively not something I can do off the top of my head, either, but it should be pretty easy to translate the ideas from working on concrete indices to considering everything relative to the iteration variable. Unfortunately, unless the existing unroll approach & heuristic start becoming a performance bottleneck, it's hard to justify re-inventing the wheel here. Especially given that we now also have Viaduct-HE and it's more general approach to finding batching opportunities.

@j2kun
Copy link
Collaborator Author

j2kun commented Mar 5, 2024

Update: The DRR patterns now work, and I've narrowed down the slow canonicalization on the huge IR to our secret.generic canonicalizers. It looks like HoistPlaintextOps is slow: we're doing a linear scan to find an op that can be hoisted, then only hoisting one. So we should hoist them all in a loop instead.

@j2kun j2kun force-pushed the simd-packing branch 2 times, most recently from d702475 to 45ca376 Compare March 7, 2024 03:58
@AlexanderViand-Intel
Copy link
Collaborator

Following up from the working group meeting earlier today, here's all the notes on "ordering" of the passes I found in the HECO paper and the code:

Collecting potentially "rotate-and-reduce"-able BinOps into variadic ops must happen before the main batching pass (this one's pretty obvious, not sure why this one made it into the paper but none of the other ones)
image

Then, there's a few comments in the pipeline, again there's one related to rotate-and-reduce that mentions that we can only detect that the operands in the variadic ops as being extracted from the same tensor after we CSE to remove redundant versions of those tensors.

Regarding the main batching pass, there's this comment in the pipeline builder that says we must canonicalize just beforehand because it only handles constant indices (but that might be an implementation detail that doesn't apply to your DDR version).

More interestingly, there's this comment in the batching pass code itself that says "We must translate in order of appearance for this to work, so we walk manually". If this is correct, then the pattern rewriter approach might not work for this step (unless you can force the rewriter to do it top-down?). However, I'm not convinced that this comment is actually true...

@j2kun j2kun force-pushed the simd-packing branch 2 times, most recently from dabcbfd to 3a760c0 Compare March 8, 2024 15:26
@j2kun
Copy link
Collaborator Author

j2kun commented Mar 8, 2024

At this point I have the box_blur (4x4 and 64x64) and hamming distance examples from HECO producing equivalent (but not identical) programs in HEIR. While it's not perfect, I think this is a good time to checkpoint the code for review.

For example, box-blur 64x64

bazel-bin/tools/heir-opt '--secretize=entry-function=box_blur' --wrap-generic --canonicalize --cse --full-loop-unroll \
 --insert-rotate \
 --cse --canonicalize \
 --collapse-insertion-chains --canonicalize \
 $PWD/tests/simd/box_blur_64x64.mlir

module {
  func.func @box_blur(%arg0: !secret.secret<tensor<4096xi16>>) -> !secret.secret<tensor<4096xi16>> {
    %c65_i32 = arith.constant 65 : i32
    %c127 = arith.constant 127 : index
    %c4032 = arith.constant 4032 : index
    %c3968 = arith.constant 3968 : index
    %c63 = arith.constant 63 : index
    %0 = secret.generic ins(%arg0 : !secret.secret<tensor<4096xi16>>) {
    ^bb0(%arg1: tensor<4096xi16>):
      %1 = tensor_ext.rotate %arg1, %c3968 : tensor<4096xi16>, index
      %2 = tensor_ext.rotate %arg1, %c4032 : tensor<4096xi16>, index
      %3 = arith.addi %1, %2 : tensor<4096xi16>
      %4 = arith.addi %3, %arg1 : tensor<4096xi16>
      %5 = tensor_ext.rotate %4, %c63 : tensor<4096xi16>, index
      %6 = arith.addi %5, %2 : tensor<4096xi16>
      %7 = arith.addi %6, %arg1 : tensor<4096xi16>
      %8 = tensor_ext.rotate %7, %c63 : tensor<4096xi16>, index
      %9 = tensor_ext.rotate %arg1, %c127 : tensor<4096xi16>, index
      %10 = arith.addi %8, %9 : tensor<4096xi16>
      %11 = arith.addi %10, %arg1 : tensor<4096xi16>
      %12 = tensor_ext.rotate %11, %c3968 : tensor<4096xi16>, index
      %13 = arith.addi %12, %2 : tensor<4096xi16>
      %14 = arith.addi %13, %arg1 : tensor<4096xi16>
      %15 = tensor_ext.rotate %14, %c65_i32 : tensor<4096xi16>, i32
      secret.yield %15 : tensor<4096xi16>
    } -> !secret.secret<tensor<4096xi16>>
    return %0 : !secret.secret<tensor<4096xi16>>
  }
}

Notice the above differs from HECO in that not all rotation operations are done before the addition operations. I don't know if this is good or bad for the ultimate program runtime, but my basis of "equivalent" is merely in counting ops.

The pass breaks down into three components:

InsertRotate

This is the bulk of the work, applying the patterns in TensorExtRotationAlignment.td to insert rotations and align slots according to a subset of the same rules as HECO.

What is NOT included is aligning slots to non-arithmetic-op downstream users (like extract, see CollapseInsertionChains).

CSE/canonicalize

InsertRotate alone does not put the IR into a form that is suitable for CSE/canonicalize to simplify, so I added patterns in TensorExtCanonicalization.td to account for this:

  • FactorParallelRotationsThroughOp: push parallel rotations through arithmetic ops, which removes one rotation and better aligns duplicate arith ops and duplicate rotation ops for cse
  • DropZeroRotation: eliminate a rotation by shift of zero
  • NormalizeRotationIndex: ensure all rotation shifts are positive and within tensor dim bounds
  • SumOfConstantRotations: combine sequences of rotations by constants

CollapseInsertionChains

The above is enough to simplify the entire IR, and what is left may be a final "rotation" as implemented by a sequence of extract and insert ops into the final output.

I could have added a pattern to InsertRotate that accounted for this by aligning rotations to downstream extract ops, but I felt I could handle this case more generally by matching any well-structured chain of insertions and replacing it with a rotate.

Performance

As far as performance, the 64x64 example finishes in ~1 minute. Here's the timing breakdown. I think there are likely some performance improvements I could make to the insertion chain pass, since it should "just" be doing a linear scan of the IR. But since it's fast enough for now, I will leave that as a followup task. It may become irrelevant if we replace this with an affine analysis that avoids unrolling the loop.

===-------------------------------------------------------------------------===
                         ... Execution time report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 56.1786 seconds

  ----Wall Time----  ----Name----
    0.0042 (  0.0%)  Parser
    0.0004 (  0.0%)  Secretize
    0.0013 (  0.0%)  WrapGeneric
    0.0009 (  0.0%)  Canonicalizer
    0.0002 (  0.0%)  CSE
    0.0000 (  0.0%)    (A) DominanceInfo
    3.2923 (  5.9%)  FullLoopUnroll
   25.2879 ( 45.0%)  InsertRotate
    0.6241 (  1.1%)  CSE
    0.0000 (  0.0%)    (A) DominanceInfo
    0.2596 (  0.5%)  Canonicalizer
   26.7031 ( 47.5%)  CollapseInsertionChains
    0.0008 (  0.0%)  Canonicalizer
    0.0019 (  0.0%)  Output
    0.0018 (  0.0%)  Rest
   56.1786 (100.0%)  Total

@j2kun j2kun requested a review from asraa March 8, 2024 15:45
@j2kun j2kun marked this pull request as ready for review March 8, 2024 15:47
@j2kun j2kun changed the title WIP: simd-batching for BGV Initial port of HECO auto-SIMD passes for BGV Mar 8, 2024
Copy link
Collaborator

@AlexanderViand-Intel AlexanderViand-Intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all thanks for porting the HECO batching optimization! 🤩 Expressing this directly over tensor and expressing the rewrites in DDR instead of C++makes this re-implementation a lot cleaner than the original (as evidenced by the fact that my review comments are mostly nitpicking).

I did notice one minor and one major departure from the HECO solutions, though, which mean the optimization falls a bit short in terms of final program performance: the first one (lack of combine op, see comment below) is only really relevant for programs that don't perfectly batch, but the second one (issues with "target slot" selection, I think) also affects things like boxblur and mean that the generated code would be significantly slower.

Notice the above differs from HECO in that not all rotation operations are done before the addition operations. I don't know if this is good or bad for the ultimate program runtime, but my basis of "equivalent" is merely in counting ops.

What do you mean by equivalent when counting ops here? For example, this is what HECO produces for boxblur
(heco --fhe-pass evaluation/comparison/heco_input/boxblur_64x64.mlir):

func.func private @encryptedBoxBlur_64x64(%arg0: !fhe.batched_secret<4096 x i16>) -> !fhe.batched_secret<4096 x i16> {
    %0 = fhe.rotate(%arg0) by 64 : <4096 x i16>
    %1 = fhe.rotate(%arg0) by 65 : <4096 x i16>
    %2 = fhe.rotate(%arg0) by 1 : <4096 x i16>
    %3 = fhe.add(%arg0, %0, %1, %2) : (!fhe.batched_secret<4096 x i16>, !fhe.batched_secret<4096 x i16>, !fhe.batched_secret<4096 x i16>, !fhe.batched_secret<4096 x i16>) -> !fhe.batched_secret<4096 x i16>
    return %3 : !fhe.batched_secret<4096 x i16>
}

Note that there are only three rotations, as the kernel is 2x2 so we need to bring together four pixels (one of which we don't need to rotate). In comparison, the HEIR version needs seven rotations which is obviously still a massive improvement over the baseline, but since rotations make up for virtually all of the runtime of this program, it's still around 2x slower.

What is NOT included is aligning slots to non-arithmetic-op downstream users (like extract, see CollapseInsertionChains).

I think this might be the cause of these differences: without the lookahead ("target slot selection" in HECO parlance), you might not be going to get the full "magic". Alternatively, it might be because the additions aren't grouped into a single op? I'll try and stare at the IR produced in the intermediate steps a bit more since it would be nice to close the gap, but I don't think it's totally necessary, either.

EDIT: After looking at the IR after unrolling (before --insert-rotate) I don't think it's as simple as that, as you'd have to walk the def-use chain all the way until you find a tensor.insert to figure out the target slot. This is less of an issue in HECO, where the n-ary rewrite means that the insert op will usually be the immediate next op. Since this seems like an artifact of the unrolling approach, I'd ignore it for now and hope this goes away in an affine loop analysis-based implementation.

As far as [compile time] performance, the 64x64 example finishes in ~1 minute. [...] It may become irrelevant if we replace this with an affine analysis that avoids unrolling the loop.

It's hard to do an apples-to-apples comparison of the compile time between HECO and HEIR, because HEIR obviously has the whole secret.generic handling in there, too, but both take <1 min on my (server) machine which seems good enough for now, especially given that a non-unrolled version should be significantly faster anyway.

@j2kun
Copy link
Collaborator Author

j2kun commented Mar 11, 2024

What do you mean by equivalent when counting ops here?

I was basing that claim on this file which contains eight rotations (but the last one is only necessary to extract the constant term) as do all subsequent snapshotted files in that directory. As I understand it, HECO is also applying the rotate-and-sum pattern to reduce that to log_2(8) = 3 rotations? So perhaps that was just omitted from those checked in files. I have a draft PR that I haven't yet pushed that implements a rotate-and-sum pattern in isolation, but now I see that this is why having rotations interspersed with addition ops is a problem: it will be harder to identify rotate-and-sum opportunities. However, if that's OK with you, I'd still like to check this PR in so I can have a good baseline to improve against.

as you'd have to walk the def-use chain all the way until you find a tensor.insert to figure out the target slot.

I think the slot target selection is actually fine, but what is not included is target slot selection when the downstream op is an insert. I could add this (it would just be a slight variation on the patterns already in this PR), but I'd want to find a counterexample program that shows it's necessary first. I will file an issue and put a TODO in the code.

@AlexanderViand-Intel
Copy link
Collaborator

Thanks for the updates - once the build error is fixed, this should be good to merge from my perspective 👍

I was basing that claim on this file which contains eight rotations (but the last one is only necessary to extract the constant term) as do all subsequent snapshotted files in that directory. As I understand it, HECO is also applying the rotate-and-sum pattern to reduce that to log_2(8) = 3 rotations?

I don't think rotate-and-reduce gives you anything here, this is just an out-of-date file 🙈 I focused on evaluation/ examples for the USENIX artifact evaluation so there's clearly some inconsistent/outdated stuff in test/ (probably shouldn't have switched off CI after losing the university credits, even if it's a bit pricy 💸)

I think the slot target selection is actually fine, but what is not included is target slot selection when the downstream op is an insert.

Yes, in HECO the "target" slot is specifically the index at which a result is later inserted (or -1 if nothing is found). Though, iirc, there's a tradeoff: if you look ahead too far, you might remove rotate-and-reduce opportunities. I expect some (maybe all?) of this to be side effects of the unroll-and-cse approach, though so it might not be worth investing too much effort to fix.

@j2kun
Copy link
Collaborator Author

j2kun commented Mar 11, 2024

TODOs and new issues filed (along with minor cleanup and build fix)

- InsertRotate: insert rotations and apply target slot selection rules
- TensorExtCanonicalization: canonicalization patterns to enable
  cse to remove unnecessary rotations
- CollapseInsertionChains: identify extract/insert chains that can be
  converted to rotations

Additional followup issues identified in
google#471 for improvements.
@j2kun j2kun added the pull_ready Indicates whether a PR is ready to pull. The copybara worker will import for internal testing label Mar 11, 2024
@copybara-service copybara-service bot merged commit 3ffb592 into google:main Mar 11, 2024
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pull_ready Indicates whether a PR is ready to pull. The copybara worker will import for internal testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants