New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial port of HECO auto-SIMD passes for BGV #471
Conversation
89fd01b
to
6eb3b9e
Compare
Some notes: With this PR as of b1d3eec, I can use the following pipeline to fully unroll and vectorize the box_blur program from HECO: bazel run //tools:heir-opt -- --secretize=entry-function=box_blur --wrap-generic --canonicalize --cse --full-loop-unroll --straight-line-vectorize $PWD/tests/simd/box_blur.mlir However, after the full loop unroll, the program has some 500k lines, and if you try to I also tried the following variant that, although it's specially crafted for bazel run //tools:heir-opt -- --secretize=entry-function=box_blur --wrap-generic --canonicalize --cse --affine-loop-unroll=unroll-factor=2 --affine-loop-unroll=unroll-factor=2 --canonicalize --cse --affine-loop-unroll=unroll-factor=64 --canonicalize --cse --affine-loop-unroll=unroll-factor=64 --canonicalize --cse $PWD/tests/simd/box_blur.mlir If you remove the last three flags, and run bazel run //tools:heir-opt -- --secretize=entry-function=box_blur --wrap-generic --canonicalize --cse --affine-loop-unroll=unroll-factor=2 --affine-loop-unroll=unroll-factor=2 --canonicalize --cse --affine-loop-unroll=unroll-factor=64 --canonicalize --cse $PWD/tests/simd/box_blur.mlir Then you get a more reasonable program: an affine loop of size 64 with 900-ish statements in the body: https://gist.github.com/j2kun/3d3bf0d41b1f503f42ecd54753b2bd31 The checked-in HECO MLIR files seem to be working with a smaller dimension than 64x64 (cf. https://github.com/MarbleHE/HECO/blob/main/test/filecheck/boxblur/boxblur_2_unrolled.mlir which has only ~1.1k ops compared to the ~100k that a full unroll gives when I run it). Perhaps @AlexanderViand-Intel can comment more on this discrepancy. Either way, I think it is safe to say that if we want to test a BGV-to-OpenFHE path, it would make sense to stop short of a full unroll. But then we need some heuristic to determine when to stop unrolling. After the above, applying Which, excluding all the tensor ops, reduces to: module {
func.func @box_blur(%arg0: !secret.secret<tensor<4096xi16>>) -> !secret.secret<tensor<4096xi16>> {
%0 = secret.generic ins(%arg0 : !secret.secret<tensor<4096xi16>>) {
^bb0(%arg1: tensor<4096xi16>):
%1 = affine.for %arg2 = 0 to 64 iter_args(%arg3 = %arg1) -> (tensor<4096xi16>) {
%2 = arith.addi %arg2, %c-1 : index
%3 = arith.muli %from_elements, %cst_3 : tensor<2xindex>
%4 = arith.addi %arg2, %c1 : index
%5 = arith.muli %4, %c64 : index
%6 = arith.addi %from_elements_5, %cst_2 : tensor<126xindex>
%7 = arith.remui %from_elements_132, %cst_1 : tensor<128xindex>
%8 = arith.addi %from_elements_292, %cst_0 : tensor<63xindex>
%9 = arith.remui %from_elements_356, %cst : tensor<64xindex>
%10 = arith.addi %extracted, %c63 : index
%11 = arith.remui %10, %c4096 : index
%12 = arith.addi %extracted_4, %c63 : index
%13 = arith.remui %12, %c4096 : index
%14 = arith.addi %5, %c63 : index
%15 = arith.remui %14, %c4096 : index
%16 = arith.addi %extracted, %c64 : index
%17 = arith.remui %16, %c4096 : index
%18 = arith.addi %extracted_4, %c64 : index
%19 = arith.remui %18, %c4096 : index
%20 = arith.addi %5, %c64 : index
%21 = arith.remui %20, %c4096 : index
%22 = arith.addi %from_elements_588, %from_elements_589 : tensor<64xi16>
%23 = arith.addi %from_elements_660, %from_elements_661 : tensor<64xi16>
%24 = arith.addi %from_elements_726, %from_elements_654 : tensor<64xi16>
%25 = arith.addi %from_elements_791, %from_elements_655 : tensor<64xi16>
%26 = arith.addi %from_elements_856, %from_elements_656 : tensor<64xi16>
%27 = arith.addi %from_elements_921, %from_elements_657 : tensor<64xi16>
%28 = arith.addi %from_elements_986, %from_elements_658 : tensor<64xi16>
%29 = arith.addi %from_elements_1051, %from_elements_659 : tensor<64xi16>
affine.yield %inserted_1178 : tensor<4096xi16>
}
secret.yield %1 : tensor<4096xi16>
} -> !secret.secret<tensor<4096xi16>>
return %0 : !secret.secret<tensor<4096xi16>>
}
} As you can see, most of the ops naturally batch into size 64, but at the beginning some don't, and we'd need to accommodate for that in the |
Mh, the unrolling issues are odd! HECO's evaluation contains this very similar 64*64 roberts cross example which compiles in a few seconds (definitively not hours). I'll try and look into that before our meeting this week, but I don't see any obvious differences in what the HEIR version and the HECO version are doing that would explain this. Btw, it's probably worth noting that HECO's heuristics conceptually don't require the unrolling, that's just a shortcut to simplify the implementation. The relative offset/target slot logic should all work as affine loop analyses, too. |
Update: I think we can do this entirely with DRR-generated patterns, but I found a bug in the way they handle variadic inputs, which is required to work with tensor extract ops. Jacques said he thinks he has a quick fix and will address it soon. https://discourse.llvm.org/t/compilation-failure-with-drr-generated-pattern/77385 In the mean time, the rotation alignment patterns seem to work, and while this PR needs a lot of cleanup, I think we can get something in before Thursday. |
I would love to try to build a more general affine loop analysis instead of unrolling, I'm just not quite sure how to do that. |
Do you still experience the time-out issue with unrolling the 64*64 boxblur? I just reinstalled HECO and ran
The boxblur example is indeed a 64x64 program: doing just (
It's definitively not something I can do off the top of my head, either, but it should be pretty easy to translate the ideas from working on concrete indices to considering everything relative to the iteration variable. Unfortunately, unless the existing unroll approach & heuristic start becoming a performance bottleneck, it's hard to justify re-inventing the wheel here. Especially given that we now also have Viaduct-HE and it's more general approach to finding batching opportunities. |
Update: The DRR patterns now work, and I've narrowed down the slow canonicalization on the huge IR to our secret.generic canonicalizers. It looks like |
d702475
to
45ca376
Compare
Following up from the working group meeting earlier today, here's all the notes on "ordering" of the passes I found in the HECO paper and the code: Collecting potentially "rotate-and-reduce"-able BinOps into variadic ops must happen before the main batching pass (this one's pretty obvious, not sure why this one made it into the paper but none of the other ones) Then, there's a few comments in the pipeline, again there's one related to rotate-and-reduce that mentions that we can only detect that the operands in the variadic ops as being extracted from the same tensor after we CSE to remove redundant versions of those tensors. Regarding the main batching pass, there's this comment in the pipeline builder that says we must canonicalize just beforehand because it only handles constant indices (but that might be an implementation detail that doesn't apply to your DDR version). More interestingly, there's this comment in the batching pass code itself that says "We must translate in order of appearance for this to work, so we walk manually". If this is correct, then the pattern rewriter approach might not work for this step (unless you can force the rewriter to do it top-down?). However, I'm not convinced that this comment is actually true... |
dabcbfd
to
3a760c0
Compare
At this point I have the box_blur (4x4 and 64x64) and hamming distance examples from HECO producing equivalent (but not identical) programs in HEIR. While it's not perfect, I think this is a good time to checkpoint the code for review. For example, box-blur 64x64 bazel-bin/tools/heir-opt '--secretize=entry-function=box_blur' --wrap-generic --canonicalize --cse --full-loop-unroll \
--insert-rotate \
--cse --canonicalize \
--collapse-insertion-chains --canonicalize \
$PWD/tests/simd/box_blur_64x64.mlir
module {
func.func @box_blur(%arg0: !secret.secret<tensor<4096xi16>>) -> !secret.secret<tensor<4096xi16>> {
%c65_i32 = arith.constant 65 : i32
%c127 = arith.constant 127 : index
%c4032 = arith.constant 4032 : index
%c3968 = arith.constant 3968 : index
%c63 = arith.constant 63 : index
%0 = secret.generic ins(%arg0 : !secret.secret<tensor<4096xi16>>) {
^bb0(%arg1: tensor<4096xi16>):
%1 = tensor_ext.rotate %arg1, %c3968 : tensor<4096xi16>, index
%2 = tensor_ext.rotate %arg1, %c4032 : tensor<4096xi16>, index
%3 = arith.addi %1, %2 : tensor<4096xi16>
%4 = arith.addi %3, %arg1 : tensor<4096xi16>
%5 = tensor_ext.rotate %4, %c63 : tensor<4096xi16>, index
%6 = arith.addi %5, %2 : tensor<4096xi16>
%7 = arith.addi %6, %arg1 : tensor<4096xi16>
%8 = tensor_ext.rotate %7, %c63 : tensor<4096xi16>, index
%9 = tensor_ext.rotate %arg1, %c127 : tensor<4096xi16>, index
%10 = arith.addi %8, %9 : tensor<4096xi16>
%11 = arith.addi %10, %arg1 : tensor<4096xi16>
%12 = tensor_ext.rotate %11, %c3968 : tensor<4096xi16>, index
%13 = arith.addi %12, %2 : tensor<4096xi16>
%14 = arith.addi %13, %arg1 : tensor<4096xi16>
%15 = tensor_ext.rotate %14, %c65_i32 : tensor<4096xi16>, i32
secret.yield %15 : tensor<4096xi16>
} -> !secret.secret<tensor<4096xi16>>
return %0 : !secret.secret<tensor<4096xi16>>
}
} Notice the above differs from HECO in that not all rotation operations are done before the addition operations. I don't know if this is good or bad for the ultimate program runtime, but my basis of "equivalent" is merely in counting ops. The pass breaks down into three components: InsertRotateThis is the bulk of the work, applying the patterns in What is NOT included is aligning slots to non-arithmetic-op downstream users (like extract, see CollapseInsertionChains). CSE/canonicalizeInsertRotate alone does not put the IR into a form that is suitable for CSE/canonicalize to simplify, so I added patterns in
CollapseInsertionChainsThe above is enough to simplify the entire IR, and what is left may be a final "rotation" as implemented by a sequence of extract and insert ops into the final output. I could have added a pattern to PerformanceAs far as performance, the 64x64 example finishes in ~1 minute. Here's the timing breakdown. I think there are likely some performance improvements I could make to the insertion chain pass, since it should "just" be doing a linear scan of the IR. But since it's fast enough for now, I will leave that as a followup task. It may become irrelevant if we replace this with an affine analysis that avoids unrolling the loop.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all thanks for porting the HECO batching optimization! 🤩 Expressing this directly over tensor
and expressing the rewrites in DDR instead of C++makes this re-implementation a lot cleaner than the original (as evidenced by the fact that my review comments are mostly nitpicking).
I did notice one minor and one major departure from the HECO solutions, though, which mean the optimization falls a bit short in terms of final program performance: the first one (lack of combine
op, see comment below) is only really relevant for programs that don't perfectly batch, but the second one (issues with "target slot" selection, I think) also affects things like boxblur
and mean that the generated code would be significantly slower.
Notice the above differs from HECO in that not all rotation operations are done before the addition operations. I don't know if this is good or bad for the ultimate program runtime, but my basis of "equivalent" is merely in counting ops.
What do you mean by equivalent when counting ops here? For example, this is what HECO produces for boxblur
(heco --fhe-pass evaluation/comparison/heco_input/boxblur_64x64.mlir
):
func.func private @encryptedBoxBlur_64x64(%arg0: !fhe.batched_secret<4096 x i16>) -> !fhe.batched_secret<4096 x i16> {
%0 = fhe.rotate(%arg0) by 64 : <4096 x i16>
%1 = fhe.rotate(%arg0) by 65 : <4096 x i16>
%2 = fhe.rotate(%arg0) by 1 : <4096 x i16>
%3 = fhe.add(%arg0, %0, %1, %2) : (!fhe.batched_secret<4096 x i16>, !fhe.batched_secret<4096 x i16>, !fhe.batched_secret<4096 x i16>, !fhe.batched_secret<4096 x i16>) -> !fhe.batched_secret<4096 x i16>
return %3 : !fhe.batched_secret<4096 x i16>
}
Note that there are only three rotations, as the kernel is 2x2
so we need to bring together four pixels (one of which we don't need to rotate). In comparison, the HEIR version needs seven rotations which is obviously still a massive improvement over the baseline, but since rotations make up for virtually all of the runtime of this program, it's still around 2x slower.
What is NOT included is aligning slots to non-arithmetic-op downstream users (like extract, see CollapseInsertionChains).
I think this might be the cause of these differences: without the lookahead ("target slot selection" in HECO parlance), you might not be going to get the full "magic". Alternatively, it might be because the additions aren't grouped into a single op? I'll try and stare at the IR produced in the intermediate steps a bit more since it would be nice to close the gap, but I don't think it's totally necessary, either.
EDIT: After looking at the IR after unrolling (before --insert-rotate
) I don't think it's as simple as that, as you'd have to walk the def-use chain all the way until you find a tensor.insert
to figure out the target slot. This is less of an issue in HECO, where the n-ary
rewrite means that the insert op will usually be the immediate next op. Since this seems like an artifact of the unrolling approach, I'd ignore it for now and hope this goes away in an affine loop analysis-based implementation.
As far as [compile time] performance, the 64x64 example finishes in ~1 minute. [...] It may become irrelevant if we replace this with an affine analysis that avoids unrolling the loop.
It's hard to do an apples-to-apples comparison of the compile time between HECO and HEIR, because HEIR obviously has the whole secret.generic
handling in there, too, but both take <1 min on my (server) machine which seems good enough for now, especially given that a non-unrolled version should be significantly faster anyway.
include/Dialect/TensorExt/Transforms/TensorExtRotationAlignment.td
Outdated
Show resolved
Hide resolved
I was basing that claim on this file which contains eight rotations (but the last one is only necessary to extract the constant term) as do all subsequent snapshotted files in that directory. As I understand it, HECO is also applying the rotate-and-sum pattern to reduce that to log_2(8) = 3 rotations? So perhaps that was just omitted from those checked in files. I have a draft PR that I haven't yet pushed that implements a rotate-and-sum pattern in isolation, but now I see that this is why having rotations interspersed with addition ops is a problem: it will be harder to identify rotate-and-sum opportunities. However, if that's OK with you, I'd still like to check this PR in so I can have a good baseline to improve against.
I think the slot target selection is actually fine, but what is not included is target slot selection when the downstream op is an insert. I could add this (it would just be a slight variation on the patterns already in this PR), but I'd want to find a counterexample program that shows it's necessary first. I will file an issue and put a TODO in the code. |
Thanks for the updates - once the build error is fixed, this should be good to merge from my perspective 👍
I don't think rotate-and-reduce gives you anything here, this is just an out-of-date file 🙈 I focused on
Yes, in HECO the "target" slot is specifically the index at which a result is later inserted (or |
TODOs and new issues filed (along with minor cleanup and build fix)
|
- InsertRotate: insert rotations and apply target slot selection rules - TensorExtCanonicalization: canonicalization patterns to enable cse to remove unnecessary rotations - CollapseInsertionChains: identify extract/insert chains that can be converted to rotations Additional followup issues identified in google#471 for improvements.
Partial progress toward #475