Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check for minor GC on all loop iterations #11572

Closed
markghayden opened this issue Sep 27, 2022 · 9 comments
Closed

check for minor GC on all loop iterations #11572

markghayden opened this issue Sep 27, 2022 · 9 comments
Labels

Comments

@markghayden
Copy link

markghayden commented Sep 27, 2022

The optimizing compiler seems to generate a check for minor GC on every loop iteration. This happens for while, for, let rec. No matter how simple the loop. See examples below. This was tested on Ocaml 4.14.0 on MacOS. These seem to be unnecessary... maybe I'm missing something?

`
let rec_loop ofs =
let rec loop ofs =
loop ofs
in
loop ofs
;;

let while_loop ofs =
while true do ()
done
;;

let for_loop len =
for ofs = 0 to len do ()
done
;;
`

Resulting assembly looks like the following (below is for the for_loop function).
loop.zip

_camlLoop__for_loop_32:
	.cfi_startproc
	subq	$8, %rsp
	.cfi_adjust_cfa_offset 8
L115:
	movl	$1, %ebx
	cmpq	%rax, %rbx
	jg	L113
L114:
	movq	%rbx, %rdi
	addq	$2, %rbx
	cmpq	%rax, %rdi
	je	L113
	cmpq	(%r14), %r15
	ja	L114
	jmp	L116
L113:
	movl	$1, %eax
	addq	$8, %rsp
	.cfi_adjust_cfa_offset -8
	ret
	.cfi_adjust_cfa_offset 8
L116:
	call	_caml_call_gc
L117:
	jmp	L114
	.cfi_adjust_cfa_offset -8
	.cfi_endproc
@xavierleroy
Copy link
Contributor

These checks are necessary for proper signal handling (in OCaml 4) and proper operation of the multicore GC (in OCaml 5). You'll be glad(?) to learn that OCaml 5 adds more of these checks on selected tail calls so that no code can loop forever without checking for GC and signal requests. The cost is relatively low (a compare and a correctly-predicted conditional branch), so we're comfortable having one check per loop iteration. The only case where the checks could be omitted safely is a counter "for" loop when the compiler knows that the number of iterations is small. I don't think this is worth pursuing.

@markghayden
Copy link
Author

Thanks for the explanation. That makes sense.

Wow; every loop iteration now has overhead of a memory read (presumably cached), comparison, and conditional jump.

I think an annotation should be added for marking locations where these tests are not wanted.

@gasche
Copy link
Member

gasche commented Sep 28, 2022

Actually this "poll" operation is only inserted if there isn't already a poll or an allocation in the loop body. The performance degradation would only affect very tight loops. Do you have some actual code that is strongly affected by the change?

@gadmm
Copy link
Contributor

gadmm commented Sep 28, 2022

Wow; every loop iteration now has overhead of a memory read (presumably cached), comparison, and conditional jump.

Well-predicted branches are extremely cheap in modern hardware. (In your example ja: a backwards jump predicted taken in the absence of predictor history.) At the time of the polling PR, the main concern was the effect on code size (e.g. instruction decoding overhead).

I second the request for actual affected code. (An optimizing compiler could consider loop unrolling, etc.)

@gadmm
Copy link
Contributor

gadmm commented Sep 28, 2022

However, the generated code for the loop seems suboptimal, indeed this changes the branch prediction in the absence of predictor history: the "back-edge" is predicted as not-taken rather than taken. One way to avoid this would be to move the polling code from the bottom to the top of the loop (and maybe add a one-use jump to skip polling in the first iteration).

The poll-at-top vs. poll-at-bottom strategies were thoroughly discussed at #10039 (comment) and results of benchmarks were used to choose poll-at-bottom (#10039 (comment)). However this was decided using instruction count, which does not account for branch prediction, and this did not test a version with an additional (well-predicted) jmp to skip polling before the first iteration.

@xavierleroy
Copy link
Contributor

However, the generated code for the loop seems suboptimal, indeed this changes the branch prediction in the absence of predictor history: the "back-edge" is predicted as not-taken rather than taken

Really? In the example, the "back edge" is ja L114, which is predicted taken as it is a backward branch.

@gadmm
Copy link
Contributor

gadmm commented Sep 28, 2022

What counts is that je L113 predicts an early exit from the loop (hence my quotes around "back-edge"). The jump following the looping condition is a backwards jump without polling (or with polling-at-top), but a forward jump with polling-at-bottom.

@gadmm
Copy link
Contributor

gadmm commented Sep 28, 2022

Sorry, I confused myself: the forward jump je L113 is correctly predicted as not-taken, so both version are equivalent in terms of branch prediction. Sorry for the noise.

@markghayden
Copy link
Author

Good points. It is hard to detect much difference with the added polling instructions, at least on AMD64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants