Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor compilation of dependent pattern-matching in bytecode #18933

Open
silene opened this issue Apr 15, 2024 · 2 comments
Open

Poor compilation of dependent pattern-matching in bytecode #18933

silene opened this issue Apr 15, 2024 · 2 comments
Labels
kind: performance Improvements to performance and efficiency. kind: wish Feature or enhancement requests. part: VM Virtual machine.

Comments

@silene
Copy link
Contributor

silene commented Apr 15, 2024

Is your feature request related to a problem?

A commonly encountered idiom in Coq is

match x return (P x -> _)  with
| C1 => fun H: P C1 => ...
| C2 => fun H: P C2 => ...
| ...
end H (* <- note the application *)

The way it is compiled to bytecode is rather inefficient. First, H is pushed on the stack, then a continuation label is pushed too. Then, the pattern matching is performed. Every branch allocates a closure and fill it with the needed free variables. Then it pops the label and jumps to it. Finally, the closure is applied to the element that was pushed on the stack.

To summarize, during evaluation, a closure is allocated and immediately consumed.

Proposed solution

Ideally, no closure would be allocated and applied, since the branches can be directly executed, which would bring a large speedup.

There is a snag with open terms, though. Indeed, during strong normalization, the branches need to actually return closures. In other words, for normalization, we still need the bytecode as it currently is, which means that a different bytecode would be executed during evaluation and during normalization. Moreover, no sharing would exist between both bytecodes, since the one for evaluation would access the free variables directly on the stack (ACC opcode), while the one for normalization would access the free variables from the closure environment (ENVACC opcode). So, it would cause some bytecode duplication (and an exponential blowup in case of nested matches). It is not obvious if a well-thought prefix that would copy the closure environment on the stack would make it possible to delegate part of the normalization bytecode to the evaluation bytecode.

Alternative solutions

No response

Additional context

This is remotely related to #14109.

@silene silene added kind: performance Improvements to performance and efficiency. part: VM Virtual machine. kind: wish Feature or enhancement requests. labels Apr 15, 2024
@ppedrot
Copy link
Member

ppedrot commented Apr 17, 2024

The problem is not limited to this kind of commuting closure. More generally, in the Coq VM any non-tailcall pattern-matching has to push a dummy function frame due to the MAKESWITCHBLOCK instruction, and the branches perform a RETURN to that frame instead of jumping to the context. Said otherwise, the bytecode of a non-tailcall match looks like

PUSHRETADDR; SWITCH[table];
lbl_accu: MAKESWITCHBLOCK; RETURN 0;
lbl_arg1: (* branch1 *); RETURN nargs1;
...
lbl_argn: (* branchn *); RETURN nargsn

This is in stark contrast to the OCaml VM, where there is no PUSHRETADDR and thus no function frame, and the RETURN operations are just unconditional JUMPs to the continuation. Once again this scheme is designed to handle accumulators but has a non-zero cost for all users.

@silene
Copy link
Contributor Author

silene commented Apr 17, 2024

Note that PUSHRETADDR and RETURN are opcodes that do not perform any allocation. So, yes, Coq's VM is executing two opcodes instead of just one for the original OCaml's VM, but I consider the overhead to be sufficiently low that it is not worth wasting time on it. In fact, if the selected branch contains a let in, then the OCaml's VM will also be executing two opcodes, in which case there is not even an overhead from Coq's VM.

In contrast, the issue I mention is a case where a bunch of costly opcodes are being executed for naught.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: performance Improvements to performance and efficiency. kind: wish Feature or enhancement requests. part: VM Virtual machine.
Projects
None yet
Development

No branches or pull requests

2 participants