New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cranelift: Add a vcode peephole pass to the mach backend framework #8520
Comments
Subscribe to Label Action
This issue or pull request has been labeled: "isle"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
Thanks for writing this up! I'd add that some of these peephole opportunities are also available before register allocation, and might simplify the register allocation problem. (Bonus points if you can match patterns in reverse program order, so the pre-regalloc pass can be integrated with the backwards lowering pass.) But as you point out, regalloc can introduce more instances of these patterns so it's good to do it afterwards as well. For example: When removing a Your note about load-sinking is another example. It's useful to combine loads into memory operands of other instructions before register allocation because that reduces register pressure. Because this is useful we currently we do this optimization in ISLE, and by your guidelines it should still be in ISLE because liveness is a dataflow property. However we have repeatedly gotten this optimization wrong, because it's a property of the dataflow of the lowered instructions, not the dataflow of the CLIF. So I actually think there are good reasons to do this kind of peephole optimization even when we need a dataflow analysis to justify it. In particular, an analysis that "this use is the last use of this register" is necessary for load-sinking. Before regalloc we can rely on vcode being in SSA form, which means we can maintain the set of live virtual registers incrementally during the backward lowering pass. Unfortunately I think this analysis requires a full backward dataflow fix-point if we want to do it reliably after register allocation, but your point that it would be nice to sink reloads generated by RA2 is a good one. We can approximate it by checking if there's another definition of the same physical register in the same basic block following the use we want to sink into. That same analysis would be useful for another pre-regalloc optimization I want. When we lower to One extension to an optimization you mentioned is when As far as implementation goes: I don't see that this has anything in common with ISLE. I'd lean toward a Thompson-style virtual machine simulating an NFA, because I think we can start by writing that in pure Rust without it being incomprehensible, and then decide later if it's worth building a new DSL for it. That has the same property as Aho-Corasick that it doesn't backtrack. If we transform to a DFA then when matching a set of fixed strings I believe the resulting automaton is identical to the one produced by Aho-Corasick, but determinization and minimization are complicated and I think we can build a useful first version without them. Further, I think we can generalize beyond finite automata in order to match patterns like "the same register I saw in an earlier instruction" (a limited form of back-references) while staying within the Thompson VM model; that means we don't have a constant bound on the number of live states, but everything else works. |
Agreed that this could be useful both pre- and post-regalloc. It definitely seems doable to be able to run the pass both forwards and backwards, to integrate with the instruction encoding traversal and lowering respectively.
I agree that there are additional improvements we could do on vcode if we had a DFG. But it isn't clear to me that constructing a full DFG for vcode sits in the same "sweet spot" we've been trying to aim for with Cranelift in general with regards to balancing comlexity, compile time, and code quality. Not saying it doesn't fit! I honestly don't know, and I think we'd have to investigate the costs and complexity some more to clarify that for me. (And I would just have to think about it some more.) I do think that a sliding-window style peephole pass is obviously consistent with the existing system with its existing costs and tradeoffs, in my eyes, because of its relative simplicity and how it can be integrated with existing traversals and data structures (or lack thereof). Are you imagining that we would have both DFG-based and sliding-window-based peephole passes? Or that the same pass could do either style of rewrite? Because (and I don't mean to imply that you said this) I don't think the DFG-based approach subsumes the sliding-window-based approach because of how the latter can leverage instruction scheduling.
In general:
But I agree that it is not off-the-shelf reusable as-is without a new/subset prelude and some other tweaks. |
I'm beginning to think we're talking about different things when we say "dataflow". It sounds like you mean in the sense of our I agree that we don't want that here. I was referring to "dataflow analysis" in the sense of the dragon book. We may need to compute some collection of facts about instructions in the program, and that computation may depend on control flow, but we don't need a graph representation of dataflow for it. The analysis results allow us to write patterns that are only correct if their surrounding context is okay, without needing to examine the rest of the function in the pattern. This is necessary for load sinking, for example: the pattern to match is simple, but the transformation is only valid if the load is not used again afterward. In some cases we can do the analysis incrementally while evaluating the patterns in a single pass. That works for pre-regalloc liveness analysis as long as we're doing a backward pass in post-order, because of SSA form. In other cases we may need a full fix-point analysis or a heuristic approximation, such as for post-regalloc liveness analysis. |
Ah yeah I must have been reading too fast because you clearly said "dataflow analysis", and I assumed you were talking about DFG construction because I had originally been discussing avoiding the cost of constructing a DFG. Agreed that some dataflow analyses can be done incrementally in a single pass, and could integrate with a single traversal that is doing both the analysis and the peephole rewrites, and agreed that it would be cool to do. (and, of course, if/when we get around to actually prioritizing this stuff, we can talk about incremental steps, whether it makes sense to start with support for running these kinds of dataflow analyses right off the bat or not, when to implement what, and how to break all this apart into milestones at that time) |
I think this might subsume #4124. |
(Fleshing out and jotting down for posterity an idea I floated and we collectively refined at a recent Cranelift meeting.)
This peephole pass could run after regalloc, just before encoding instructions. It wouldn't need to be a whole new traversal over the IR, it could run as we walk the vcode for encoding, essentially implemented as an iterator combinator.
This peephole pass would not be DFG-based. It would be a classic sliding-window-of-instructions peephole pass. This both allows us to avoid constructing a more-complete DFG for vcode like what we have for clif, which could be expensive, and it allows us to benefit from opportunities that arise only because of instruction scheduling of two different sequences that are otherwise disconnected in the DFG.
Put another way: if you want DFG-y rewrites, use the mid-end for target-agnostic rules and the clif-to-vcode instruction selection lowering for target-specific rules. We have good solutions for that kind of thing. If however you want non-DFG-y rewrites, we currently have nothing and this issue is proposing we add something for the target-specific dimension, which is where I suspect most of the opportunity exists.
Implementation idea brain dump:
islec
vcode[i]
that we've already looked at when we called into ISLE fromvcode[i - 1]
.Between integrating with an existing traversal over the vcode, not constructing a DFG, and the string-matching sketch just above, this new peephole pass should impose extremely little overhead on compile times.
A few concrete example peephole optimizations we could do, I'm sure we can think of more and will do so if we put this infra in place:
ldp
/stp
on aarch64add
s or whatever that produces/consumes the value on x64cmp
andtest
instructions when we know that the last instruction in the stream happened to set identical flags (egand rax, r10; test rax, r10
orsub rdi, rsi; cmp rdi, rsi
)The text was updated successfully, but these errors were encountered: