check for minor GC on all loop iterations #11572

markghayden · 2022-09-27T12:12:04Z

The optimizing compiler seems to generate a check for minor GC on every loop iteration. This happens for while, for, let rec. No matter how simple the loop. See examples below. This was tested on Ocaml 4.14.0 on MacOS. These seem to be unnecessary... maybe I'm missing something?

`
let rec_loop ofs =
let rec loop ofs =
loop ofs
in
loop ofs
;;

let while_loop ofs =
while true do ()
done
;;

let for_loop len =
for ofs = 0 to len do ()
done
;;
`

Resulting assembly looks like the following (below is for the for_loop function).
loop.zip

_camlLoop__for_loop_32:
	.cfi_startproc
	subq	$8, %rsp
	.cfi_adjust_cfa_offset 8
L115:
	movl	$1, %ebx
	cmpq	%rax, %rbx
	jg	L113
L114:
	movq	%rbx, %rdi
	addq	$2, %rbx
	cmpq	%rax, %rdi
	je	L113
	cmpq	(%r14), %r15
	ja	L114
	jmp	L116
L113:
	movl	$1, %eax
	addq	$8, %rsp
	.cfi_adjust_cfa_offset -8
	ret
	.cfi_adjust_cfa_offset 8
L116:
	call	_caml_call_gc
L117:
	jmp	L114
	.cfi_adjust_cfa_offset -8
	.cfi_endproc

xavierleroy · 2022-09-27T17:24:31Z

These checks are necessary for proper signal handling (in OCaml 4) and proper operation of the multicore GC (in OCaml 5). You'll be glad(?) to learn that OCaml 5 adds more of these checks on selected tail calls so that no code can loop forever without checking for GC and signal requests. The cost is relatively low (a compare and a correctly-predicted conditional branch), so we're comfortable having one check per loop iteration. The only case where the checks could be omitted safely is a counter "for" loop when the compiler knows that the number of iterations is small. I don't think this is worth pursuing.

markghayden · 2022-09-27T21:12:14Z

Thanks for the explanation. That makes sense.

Wow; every loop iteration now has overhead of a memory read (presumably cached), comparison, and conditional jump.

I think an annotation should be added for marking locations where these tests are not wanted.

gasche · 2022-09-28T05:01:13Z

Actually this "poll" operation is only inserted if there isn't already a poll or an allocation in the loop body. The performance degradation would only affect very tight loops. Do you have some actual code that is strongly affected by the change?

gadmm · 2022-09-28T12:31:06Z

Wow; every loop iteration now has overhead of a memory read (presumably cached), comparison, and conditional jump.

Well-predicted branches are extremely cheap in modern hardware. (In your example ja: a backwards jump predicted taken in the absence of predictor history.) At the time of the polling PR, the main concern was the effect on code size (e.g. instruction decoding overhead).

I second the request for actual affected code. (An optimizing compiler could consider loop unrolling, etc.)

gadmm · 2022-09-28T14:48:41Z

However, the generated code for the loop seems suboptimal, indeed this changes the branch prediction in the absence of predictor history: the "back-edge" is predicted as not-taken rather than taken. One way to avoid this would be to move the polling code from the bottom to the top of the loop (and maybe add a one-use jump to skip polling in the first iteration).

The poll-at-top vs. poll-at-bottom strategies were thoroughly discussed at #10039 (comment) and results of benchmarks were used to choose poll-at-bottom (#10039 (comment)). However this was decided using instruction count, which does not account for branch prediction, and this did not test a version with an additional (well-predicted) jmp to skip polling before the first iteration.

xavierleroy · 2022-09-28T15:31:45Z

However, the generated code for the loop seems suboptimal, indeed this changes the branch prediction in the absence of predictor history: the "back-edge" is predicted as not-taken rather than taken

Really? In the example, the "back edge" is ja L114, which is predicted taken as it is a backward branch.

gadmm · 2022-09-28T15:39:55Z

What counts is that je L113 predicts an early exit from the loop (hence my quotes around "back-edge"). The jump following the looping condition is a backwards jump without polling (or with polling-at-top), but a forward jump with polling-at-bottom.

gadmm · 2022-09-28T15:52:13Z

Sorry, I confused myself: the forward jump je L113 is correctly predicted as not-taken, so both version are equivalent in terms of branch prediction. Sorry for the noise.

markghayden · 2022-09-28T17:08:58Z

Good points. It is hard to detect much difference with the added polling instructions, at least on AMD64.

xavierleroy added the question label Sep 27, 2022

markghayden closed this as completed Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check for minor GC on all loop iterations #11572

check for minor GC on all loop iterations #11572

markghayden commented Sep 27, 2022 •

edited by dra27

xavierleroy commented Sep 27, 2022

markghayden commented Sep 27, 2022

gasche commented Sep 28, 2022

gadmm commented Sep 28, 2022

gadmm commented Sep 28, 2022

xavierleroy commented Sep 28, 2022

gadmm commented Sep 28, 2022

gadmm commented Sep 28, 2022

markghayden commented Sep 28, 2022

check for minor GC on all loop iterations #11572

check for minor GC on all loop iterations #11572

Comments

markghayden commented Sep 27, 2022 • edited by dra27

xavierleroy commented Sep 27, 2022

markghayden commented Sep 27, 2022

gasche commented Sep 28, 2022

gadmm commented Sep 28, 2022

gadmm commented Sep 28, 2022

xavierleroy commented Sep 28, 2022

gadmm commented Sep 28, 2022

gadmm commented Sep 28, 2022

markghayden commented Sep 28, 2022

markghayden commented Sep 27, 2022 •

edited by dra27