Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault on fac test with OCaml 4.12.0 #175

Closed
jmid opened this issue Sep 11, 2021 · 7 comments
Closed

Segfault on fac test with OCaml 4.12.0 #175

jmid opened this issue Sep 11, 2021 · 7 comments

Comments

@jmid
Copy link
Collaborator

jmid commented Sep 11, 2021

I'm having problems with test_fac_issue59 from test/core/QCheck_expect_test.ml and ... issue #59.
Stripping away the expect tests I'm down to this:

open QCheck

let rec fac n = match n with
  | 0 -> 1
  | n -> n * fac (n - 1)

let test_fac =
  Test.make
    (make (Gen.return 600_000))
    (fun n ->
       (*Printf.printf "%i\n%!" n;*)
       try
         0 = fac n
       with Stack_overflow   -> false
    )

let () = QCheck_base_runner.set_seed 1234
let _ = QCheck_base_runner.run_tests [test_fac]

I've also cut away dune to reduce the number of moving parts:

$  ocamlbuild -clean && ocamlbuild -pkg qcheck fac_test.native
Finished, 0 targets (0 cached) in 00:00:00.
Finished, 4 targets (0 cached) in 00:00:00.
$ ./fac_test.native 
random seed: 1234
Segmentation fault (core dumped)

Compiling with debug information and running under gdb pinpoints the fac function and can show a long stack trace of fac frames. I have not been able to get the debug-executable to print a stack trace with OCAMLRUNPARAM=b (or variants).

The segfault disappears and the test behaves as expected if I

  • comment in the Printf.printf or
  • comment out the exception handler.

I get this behaviour with OCaml 4.12.0 across QCheck versions 0.18, 0.17, 0.15, 0.9 (I just tried a selection).
I cannot reproduce it with OCaml 4.11.2 on either of these QCheck versions.
I don't experience the problem if I instead compile and run with the bytecode backend.

I'd be grateful if others could confirm this behaviour to help understand if OS and hardware play in.
The above could indicate an OCaml issue - but it could affect QCheck when we include and trigger the test in our CI.
This is on a Linux machine, kernel 5.4.0-81, hardware is a Thinkpad with a 64-bit, dual-core Intel i5 CPU.
ulimit -s reports a stack limit of 8192.

I sometimes experience the bug as somewhat flaky (caveat: may be coffee/sleep underflow on my part... 😄)
A variant (probably increasing the required stack height) just repeats the test twice:

let _ = QCheck_base_runner.run_tests [test_fac; test_fac]

I have sometimes experienced this variant to segfault when the first one stopped doing so.

(I'll tag @gasche as he has had his hands in both QCheck and the OCaml compiler...)

@gasche
Copy link
Contributor

gasche commented Sep 12, 2021

One thing I know is that stack overflow detection is dirty business; you are trying to catch an OS-level failure and turn it back into an in-language exception that is catchable. Stackoverflow detection is known to be flaky on some platform; I don't know much about this part of the runtime, and in particular I don't know how stack-overflow detection is implemented on amd64. My vague recollection wa that it should be reliable, except for functions with an artificially-large stack size. But here you seem to observe flakiness (you manage to turn stack overflows back into segfaults some of the time), and this may be a 4.12 regression.

cc @stedolan ; stack overflow detection becoming flaky on amd64 in 4.12, would that by chance ring a bell?

@stedolan
Copy link

I can't think of any particular reason why stack overflow would be less reliable in 4.12. I can reproduce this, so I'm digging in a bit. So far it's looking like a bug in 4.12.

(Incidentally, on my machine only the run_tests [test_fac; test_fac] version seems to segfault)

@jmid
Copy link
Collaborator Author

jmid commented Sep 13, 2021

Here is another variant also triggering a segfault, now over lists (this one doesn't use an explicit Stack_overflow handler:

open QCheck

let list_equal_dupl =
  Test.make
    (make
       (*~shrink:(fun x -> Printf.printf ".%!"; Shrink.list_spine x)*)
       ~shrink:Shrink.list_spine
       (Gen.list_size (Gen.return 650_000) (Gen.return 0)))
    (fun xs ->
       (*Printf.printf ".%!";*)
       xs = xs @ xs)
;;

Test.check_exn list_equal_dupl
(* or use: *)
(* QCheck_base_runner.run_tests ~verbose:true ~rand [list_equal_dupl] *)

The symptoms are the same: segfault on 4.12.0 regardless of QCheck version, not reproducable on 4.11.2,
Commenting in the .-printing in the property makes the segfault go away (instead it enters a terribly long shrink loop 😬 ).
Replacing the shrinker with the .-printing one indicates the segfault happens during shrinking.
Without a ~shrinker it also behaves as expected.

@gasche
Copy link
Contributor

gasche commented Sep 13, 2021

@stedolan wild ideas: stackoverflow handling uses signals, 4.12 has your EINTR-based signal work (ocaml#9722), that work changes stuff in input/output channel primitives, printing to stdout seems involved in the issue here.

@stedolan
Copy link

The bug seems to be that the runtime doesn't restore the GC state correctly when recovering from a stack overflow. In particular, young allocations made before the overflow was caught can sometimes be incorrectly overwritten by future allocations. There's a fix in ocaml/ocaml#10633

(I'm now in the "how did this ever work?!" phase of debugging: the dodgy code seems to be older than 4.12)

@stedolan
Copy link

Ah, it turns out this was broken in 4.11 as well, but this particular case didn't segfault. The testcase in ocaml/ocaml#10633 is broken even in 4.11, I haven't checked how far back the bug goes.

@jmid
Copy link
Collaborator Author

jmid commented Oct 11, 2021

Closing as this was an OCaml issue (with a merged fix).

@jmid jmid closed this as completed Oct 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants