Segfault on fac test with OCaml 4.12.0 #175

jmid · 2021-09-11T11:26:40Z

I'm having problems with test_fac_issue59 from test/core/QCheck_expect_test.ml and ... issue #59.
Stripping away the expect tests I'm down to this:

open QCheck

let rec fac n = match n with
  | 0 -> 1
  | n -> n * fac (n - 1)

let test_fac =
  Test.make
    (make (Gen.return 600_000))
    (fun n ->
       (*Printf.printf "%i\n%!" n;*)
       try
         0 = fac n
       with Stack_overflow   -> false
    )

let () = QCheck_base_runner.set_seed 1234
let _ = QCheck_base_runner.run_tests [test_fac]

I've also cut away dune to reduce the number of moving parts:

$  ocamlbuild -clean && ocamlbuild -pkg qcheck fac_test.native
Finished, 0 targets (0 cached) in 00:00:00.
Finished, 4 targets (0 cached) in 00:00:00.
$ ./fac_test.native 
random seed: 1234
Segmentation fault (core dumped)

Compiling with debug information and running under gdb pinpoints the fac function and can show a long stack trace of fac frames. I have not been able to get the debug-executable to print a stack trace with OCAMLRUNPARAM=b (or variants).

The segfault disappears and the test behaves as expected if I

comment in the Printf.printf or
comment out the exception handler.

I get this behaviour with OCaml 4.12.0 across QCheck versions 0.18, 0.17, 0.15, 0.9 (I just tried a selection).
I cannot reproduce it with OCaml 4.11.2 on either of these QCheck versions.
I don't experience the problem if I instead compile and run with the bytecode backend.

I'd be grateful if others could confirm this behaviour to help understand if OS and hardware play in.
The above could indicate an OCaml issue - but it could affect QCheck when we include and trigger the test in our CI.
This is on a Linux machine, kernel 5.4.0-81, hardware is a Thinkpad with a 64-bit, dual-core Intel i5 CPU.
ulimit -s reports a stack limit of 8192.

I sometimes experience the bug as somewhat flaky (caveat: may be coffee/sleep underflow on my part... 😄)
A variant (probably increasing the required stack height) just repeats the test twice:

let _ = QCheck_base_runner.run_tests [test_fac; test_fac]

I have sometimes experienced this variant to segfault when the first one stopped doing so.

(I'll tag @gasche as he has had his hands in both QCheck and the OCaml compiler...)

The text was updated successfully, but these errors were encountered:

gasche · 2021-09-12T20:59:44Z

One thing I know is that stack overflow detection is dirty business; you are trying to catch an OS-level failure and turn it back into an in-language exception that is catchable. Stackoverflow detection is known to be flaky on some platform; I don't know much about this part of the runtime, and in particular I don't know how stack-overflow detection is implemented on amd64. My vague recollection wa that it should be reliable, except for functions with an artificially-large stack size. But here you seem to observe flakiness (you manage to turn stack overflows back into segfaults some of the time), and this may be a 4.12 regression.

cc @stedolan ; stack overflow detection becoming flaky on amd64 in 4.12, would that by chance ring a bell?

stedolan · 2021-09-13T10:02:11Z

I can't think of any particular reason why stack overflow would be less reliable in 4.12. I can reproduce this, so I'm digging in a bit. So far it's looking like a bug in 4.12.

(Incidentally, on my machine only the run_tests [test_fac; test_fac] version seems to segfault)

jmid · 2021-09-13T13:50:28Z

Here is another variant also triggering a segfault, now over lists (this one doesn't use an explicit Stack_overflow handler:

open QCheck

let list_equal_dupl =
  Test.make
    (make
       (*~shrink:(fun x -> Printf.printf ".%!"; Shrink.list_spine x)*)
       ~shrink:Shrink.list_spine
       (Gen.list_size (Gen.return 650_000) (Gen.return 0)))
    (fun xs ->
       (*Printf.printf ".%!";*)
       xs = xs @ xs)
;;

Test.check_exn list_equal_dupl
(* or use: *)
(* QCheck_base_runner.run_tests ~verbose:true ~rand [list_equal_dupl] *)

The symptoms are the same: segfault on 4.12.0 regardless of QCheck version, not reproducable on 4.11.2,
Commenting in the .-printing in the property makes the segfault go away (instead it enters a terribly long shrink loop 😬 ).
Replacing the shrinker with the .-printing one indicates the segfault happens during shrinking.
Without a ~shrinker it also behaves as expected.

gasche · 2021-09-13T14:05:26Z

@stedolan wild ideas: stackoverflow handling uses signals, 4.12 has your EINTR-based signal work (ocaml#9722), that work changes stuff in input/output channel primitives, printing to stdout seems involved in the issue here.

stedolan · 2021-09-13T14:16:02Z

The bug seems to be that the runtime doesn't restore the GC state correctly when recovering from a stack overflow. In particular, young allocations made before the overflow was caught can sometimes be incorrectly overwritten by future allocations. There's a fix in ocaml/ocaml#10633

(I'm now in the "how did this ever work?!" phase of debugging: the dodgy code seems to be older than 4.12)

stedolan · 2021-09-13T14:52:47Z

Ah, it turns out this was broken in 4.11 as well, but this particular case didn't segfault. The testcase in ocaml/ocaml#10633 is broken even in 4.11, I haven't checked how far back the bug goes.

jmid · 2021-10-11T17:20:23Z

Closing as this was an OCaml issue (with a merged fix).

stedolan mentioned this issue Sep 13, 2021

Restore minor heap pointer after a Stack_overflow ocaml/ocaml#10633

Merged

jmid mentioned this issue Sep 13, 2021

Shrinker performance benchmark #177

Merged

jmid closed this as completed Oct 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault on fac test with OCaml 4.12.0 #175

Segfault on fac test with OCaml 4.12.0 #175

jmid commented Sep 11, 2021

gasche commented Sep 12, 2021

stedolan commented Sep 13, 2021

jmid commented Sep 13, 2021

gasche commented Sep 13, 2021

stedolan commented Sep 13, 2021

stedolan commented Sep 13, 2021

jmid commented Oct 11, 2021

Segfault on fac test with OCaml 4.12.0 #175

Segfault on fac test with OCaml 4.12.0 #175

Comments

jmid commented Sep 11, 2021

gasche commented Sep 12, 2021

stedolan commented Sep 13, 2021

jmid commented Sep 13, 2021

gasche commented Sep 13, 2021

stedolan commented Sep 13, 2021

stedolan commented Sep 13, 2021

jmid commented Oct 11, 2021