Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ocaml5-issue] Segfault in STM Domain.DLS test sequential on 32-bit trunk #446

Open
jmid opened this issue Mar 26, 2024 · 18 comments
Open
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime

Comments

@jmid
Copy link
Collaborator

jmid commented Mar 26, 2024

In the CI-run for #445 on 32-bit trunk the STM Domain.DLS test sequential triggered a segfault
https://github.com/ocaml-multicore/multicoretests/actions/runs/8436771284/job/23104952265?pr=445

random seed: 107236932
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
File "src/domain/dune", line 31, characters 7-20:
31 |  (name stm_tests_dls)
            ^^^^^^^^^^^^^
(cd _build/default/src/domain && ./stm_tests_dls.exe --verbose)
Command got signal SEGV.
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)

This may be another case of a 32-bit/bytecode issue showing up in a couple of different tests:

Surprisingly this case however triggered in a sequential (single-domain) test! 😮

@gasche
Copy link

gasche commented Mar 30, 2024

I wonder if it is related to #12889, the only recent change to Domain.DLS that I can think of. (I hope not!)

@jmid
Copy link
Collaborator Author

jmid commented May 21, 2024

This just triggered again on 32-bit 5.3.0+trunk by the merge to main of #460:
https://github.com/ocaml-multicore/multicoretests/actions/runs/9169655398/job/25210472949

random seed: 103830913
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
File "src/domain/dune", line 31, characters 7-20:
31 |  (name stm_tests_dls)
            ^^^^^^^^^^^^^
(cd _build/default/src/domain && ./stm_tests_dls.exe --verbose)
Command got signal SEGV.
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)

@gasche
Copy link

gasche commented May 21, 2024

Is there more information that we can use to try to investigate this? "There is a segfault somewhere in Domain.DLS on 32bit" is not that much.

@jmid
Copy link
Collaborator Author

jmid commented May 21, 2024

First off, this is a collection of failures we observe.
Once we have fleshed out reproducible steps, these are reported upstream.
Help is very welcome, snarky remarks less so.

"There is a segfault somewhere in Domain.DLS on 32bit" is not that much.

Come on, there are QCheck seeds that caused the failures, GA workflows listing the steps taken, and links to 2 CI run logs, with full information about versions.

Run opam exec -- ocamlc -config
  opam exec -- ocamlc -config
  opam config list
  opam exec -- dune printenv
  opam list --columns=name,installed-version,repository,synopsis-or-target
  opam clean --all-switches --unused-repositories --logs --download-cache --repo-cache
  shell: /usr/bin/bash -e {0}
  env:
    QCHECK_MSG_INTERVAL: 60
    DUNE_PROFILE: dev
    OCAMLRUNPARAM: 
    DUNE_CI_ALIAS: runtest
    COMPILER: ocaml-variants.5.3.0+trunk,ocaml-option-32bit
    OCAML_COMPILER_GIT_REF: refs/heads/trunk
    CUSTOM_COMPILER_VERSION: 
    CUSTOM_COMPILER_SRC: 
    CUSTOM_OCAML_PKG_VERSION: 
    OPAMCLI: 2.0
    OPAMCOLOR: always
    OPAMERRLOGLEN: 0
    OPAMJOBS: 4
    OPAMPRECISETRACKING: 1
    OPAMSOLVERTIMEOUT: 1000
    OPAMYES: 1
    DUNE_CACHE: enabled
    DUNE_CACHE_TRANSPORT: direct
    DUNE_CACHE_STORAGE_MODE: copy
    CLICOLOR_FORCE: 1
version: 5.3.0+dev0-2023-12-22
standard_library_default: /home/runner/work/multicoretests/multicoretests/_opam/lib/ocaml
standard_library: /home/runner/work/multicoretests/multicoretests/_opam/lib/ocaml
ccomp_type: cc
c_compiler: gcc -m32
ocamlc_cflags:  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread
ocamlc_cppflags:  -D_FILE_OFFSET_BITS=64 
ocamlopt_cflags:  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread
ocamlopt_cppflags:  -D_FILE_OFFSET_BITS=64 
bytecomp_c_compiler: gcc -m32  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread  -D_FILE_OFFSET_BITS=64 
native_c_compiler: gcc -m32  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread  -D_FILE_OFFSET_BITS=64 
bytecomp_c_libraries: -lzstd  -latomic -lm  -lpthread
native_c_libraries:  -latomic -lm  -lpthread
native_ldflags: 
native_pack_linker: ld -r -o 
native_compiler: false
architecture: i386
model: default
int_size: 31
word_size: 32
system: linux
asm: i386-linux-as
asm_cfi_supported: false
with_frame_pointers: false
ext_exe: 
ext_obj: .o
ext_asm: .s
ext_lib: .a
ext_dll: .so
os_type: Unix
default_executable_name: a.out
systhread_supported: true
host: i386-pc-linux-gnu
target: i386-pc-linux-gnu
flambda: false
safe_string: true
default_safe_string: true
flat_float_array: true
function_sections: false
afl_instrument: false
tsan: false
windows_unicode: false
supports_shared_libraries: true
native_dynlink: false
naked_pointers: false
exec_magic_number: Caml1999X035
cmi_magic_number: Caml1999I035
cmo_magic_number: Caml1999O035
cma_magic_number: Caml1999A035
cmx_magic_number: Caml1999Y035
cmxa_magic_number: Caml1999Z035
ast_impl_magic_number: Caml1999M035
ast_intf_magic_number: Caml1999N035
cmxs_magic_number: Caml1999D035
cmt_magic_number: Caml1999T035
linear_magic_number: Caml1999L035

<><> Global opam variables ><><><><><><><><><><><><><><><><><><><><><><><><><><>
arch              x86_64                                          # Inferred from system
exe                                                               # Suffix needed for executable filenames (Windows)
jobs              4                                               # The number of parallel jobs set up in opam configuration
make              make                                            # The 'make' command to use
opam-version      2.1.6                                           # The currently running opam version
os                linux                                           # Inferred from system
os-distribution   ubuntu                                          # Inferred from system
os-family         debian                                          # Inferred from system
os-version        22.04                                           # Inferred from system
root              /home/runner/.opam                              # The current opam root directory
switch            /home/runner/work/multicoretests/multicoretests # The identifier of the current switch
sys-ocaml-arch                                                    # Target architecture of the OCaml compiler present on your system
sys-ocaml-cc                                                      # Host C Compiler type of the OCaml compiler present on your system
sys-ocaml-libc                                                    # Host C Runtime Library type of the OCaml compiler present on your system
sys-ocaml-version                                                 # OCaml version present on your system independently of opam, if any

<><> Configuration variables from the current switch ><><><><><><><><><><><><><>
prefix   /home/runner/work/multicoretests/multicoretests/_opam
lib      /home/runner/work/multicoretests/multicoretests/_opam/lib
bin      /home/runner/work/multicoretests/multicoretests/_opam/bin
sbin     /home/runner/work/multicoretests/multicoretests/_opam/sbin
share    /home/runner/work/multicoretests/multicoretests/_opam/share
doc      /home/runner/work/multicoretests/multicoretests/_opam/doc
etc      /home/runner/work/multicoretests/multicoretests/_opam/etc
man      /home/runner/work/multicoretests/multicoretests/_opam/man
toplevel /home/runner/work/multicoretests/multicoretests/_opam/lib/toplevel
stublibs /home/runner/work/multicoretests/multicoretests/_opam/lib/stublibs
user     runner
group    docker

<><> Package variables ('opam var --package PKG' to show) <><><><><><><><><><><>
PKG:name       # Name of the package
PKG:version    # Version of the package
PKG:depends    # Resolved direct dependencies of the package
PKG:installed  # Whether the package is installed
PKG:enable     # Takes the value "enable" or "disable" depending on whether the package is installed
PKG:pinned     # Whether the package is pinned
PKG:bin        # Binary directory for this package
PKG:sbin       # System binary directory for this package
PKG:lib        # Library directory for this package
PKG:man        # Man directory for this package
PKG:doc        # Doc directory for this package
PKG:share      # Share directory for this package
PKG:etc        # Etc directory for this package
PKG:build      # Directory where the package was built
PKG:hash       # Hash of the package archive
PKG:dev        # True if this is a development package
PKG:build-id   # A hash identifying the precise package version with all its dependencies
PKG:opamfile   # Path of the curent opam file
(flags
 (-w
  @1..3@5..28@30..39@43@46..47@49..57@61..62-40
  -strict-sequence
  -strict-formats
  -short-paths
  -keep-locs))
(ocamlc_flags (-g))
(ocamlopt_flags (-g))
(melange.compile_flags (-g))
(c_flags
 (-m32
  -O2
  -fno-strict-aliasing
  -fwrapv
  -pthread
  -fPIC
  -pthread
  -m32
  -D_FILE_OFFSET_BITS=64
  -fdiagnostics-color=always))
(cxx_flags
 (-x
  c++
  -m32
  -O2
  -fno-strict-aliasing
  -fwrapv
  -pthread
  -fPIC
  -pthread
  -fdiagnostics-color=always))
(link_flags ())
(menhir_flags ())
(menhir_explain ())
(coq_flags (-q))
(coqdoc_flags (--toc))
(js_of_ocaml_flags
 (--pretty --source-map-inline))
(js_of_ocaml_build_runtime_flags
 (--pretty --source-map-inline))
(js_of_ocaml_link_flags (--source-map-inline))
# Packages matching: installed
# Name                     # Installed # Repository # Synopsis
base-bigarray              base        default
base-domains               base        default
base-nnp                   base        default      Naked pointers prohibited in the OCaml heap
base-threads               base        default
base-unix                  base        default
dune                       3.15.2      default      Fast, portable, and opinionated build system
ocaml                      5.3.0       default      The OCaml compiler (virtual package)
ocaml-config               3           default      OCaml Switch Configuration
ocaml-option-32bit         1           default      Set OCaml to be compiled in 32-bit mode for 64-bit Linux and OS X hosts
ocaml-option-bytecode-only 1           default      Compile OCaml without the native-code compiler
ocaml-variants             5.3.0+trunk default      Current trunk
qcheck-core                0.21.3      default      Core qcheck library

@gasche
Copy link

gasche commented May 21, 2024

No snark intended, I genuinely wonder how you work with these failures. For example I'm not sure if it is reasonably easy to extract a backtrace, and/or to observe the same failure within the debug runtime. (Is this segfault due to a memory corruption, or an assert failure?)

If you prefer to work on this without upstream looking over your shoulder for now, I am happy to let you work your magic and wait for easier reproduction instructions.

@jmid
Copy link
Collaborator Author

jmid commented May 21, 2024

OK, fair enough.
Some of these remaining ones are just hard to reproduce - I suspect because they are timing or signal related.

I've been trying today for this one: https://github.com/ocaml-multicore/multicoretests/actions?query=branch%3Adomain-dls-32-bit-focus

  • for one, we are using focused tests on the CI, that lets us iterate a single test, rather than a full test suite run
  • I failed to reproduce the sequential failure, then started thinking that the crash could be caused by a parallel one, if they just happen to crash quickly before either one of them prints anything
  • Rerunning both tests didn't help reproducing
  • I then tried hard-coding the last known problematic seed
  • Lastly, I've tried adding a stress-test-mode for STM similar to Add Lin_domain.stress_test #443, so far without luck

@jmid
Copy link
Collaborator Author

jmid commented May 21, 2024

I finally managed to reproduce this one - on 5.2.0 - and only once for now. It is indeed a sequential fault! 😮
https://github.com/ocaml-multicore/multicoretests/actions/runs/9180414541/job/25244781126#step:18:762

Starting 74-th run

random seed: 103830913
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
/usr/bin/bash: line 1: 197189 Segmentation fault      (core dumped) ./focusedtest.exe -v -s 103830913
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)

@jmid
Copy link
Collaborator Author

jmid commented May 21, 2024

Switching hard-coded seed to 107236932 (the first one) works much better!
Across 500 repetitions this triggered 56 segfaults on 5.2.0
https://github.com/ocaml-multicore/multicoretests/actions/runs/9181424522/job/25248130724
and 50 segfaults on 5.3.0+trunk
https://github.com/ocaml-multicore/multicoretests/actions/runs/9181424534/job/25248130821

@jmid
Copy link
Collaborator Author

jmid commented Jun 3, 2024

I've made a bit of progress on this.

  • The issue still triggers on 5.2.0 and a fresh trunk as of this morning on the CI
  • I'm still unable to reproduce locally
  • I tried running under the debug runtime in the CI to see if an assertion would catch the issue before causing a crash, and indeed it does - 6-8 times out of 200:
    random seed: 107236932
    generated error fail pass / total     time test name
    
    [ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
    [00] file runtime/shared_heap.c; line 787 ### Assertion failed: Has_status_val(v, caml_global_heap_state.UNMARKED)
    /usr/bin/bash: line 1: 394730 Aborted                 (core dumped) ./focusedtest.exe -v -s 107236932
    [ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)
    

The assertion in question is this one:
https://github.com/ocaml/ocaml/blob/23d896786adb694d39785bd7770c537a6d8c6fe6/runtime/shared_heap.c#L787
used to verify the heap at the end of a STW.

@jmid
Copy link
Collaborator Author

jmid commented Jun 3, 2024

I've tried testing previous versions (5.0.0, 5.1.0, 5.1.1) too - both with a hard-coded seed 107236932 and with random ones. Result:

  • 5.0.0 - not able to trigger
  • 5.1.0 - not able to trigger
  • 5.1.1 - not able to trigger
  • 5.2.0 - triggering in 7-12 cases out of 200
  • trunk - triggering in 9-12 cases out of 200

@gasche
Copy link

gasche commented Jun 3, 2024

How easy/hard would it be for you to run the testsuite on an arbitrary patch, if I send you changes to the runtime that might be related to the crash?

gasche added a commit to gasche/ocaml that referenced this issue Jun 3, 2024
The call was placed here for thread-safety, to work correctly if
another thread (on the same domain) resizes the DLS array under our
feet. It is also the only notable change in terms of access to
`Caml_state->dls_root` since 5.1, so it might be involved in the
mysterious crash being investigated at

  ocaml-multicore/multicoretests#446
@gasche
Copy link

gasche commented Jun 3, 2024

Here is a proposed small patch for example: https://github.com/gasche/ocaml/commits/mysterious-dls-crash-1/

(Another obvious idea would be to revert #12889 and see whether you can still reproduce the crash. I don't see any other relevant change in the runtime, but I also read this change carefully several times without noticing anything that could explain the current observations.)

@jmid
Copy link
Collaborator Author

jmid commented Jun 3, 2024

How easy/hard would it be for you to run the testsuite on an arbitrary patch, if I send you changes to the runtime that might be related to the crash?

I should be able to do that for a feature branch like the proposed one 👍
Thanks for looking into this - I'll keep you posted.

@gasche
Copy link

gasche commented Jun 3, 2024

Thanks! This change is really a blind move, so it is unlikely to work. I think the reasonable next step is to revert #12889. Let me know if you need help doing that -- it should revert cleanly from 5.2 in particular, but I haven't actually tried.

@jmid
Copy link
Collaborator Author

jmid commented Jun 3, 2024

No cigar unfortunately:
https://github.com/ocaml-multicore/multicoretests/tree/domain-dls-32-bit-gabriel
https://github.com/ocaml-multicore/multicoretests/actions/runs/9352909686/job/25742143625

On that one I also saw this (even rarer) non-crashing misbehaviour:

Starting 89-th run

random seed: 135228812
generated error fail pass / total     time test name
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)
[✗]  136    1    0  135 / 1000     0.3s STM Domain.DLS test sequential
=== Error ======================================================================
Test STM Domain.DLS test sequential errored on (21 shrink steps):
   Get (-81244680)
exception Invalid_argument("List.nth")
================================================================================
failure (0 tests failed, 1 tests errored, ran 1 tests)

Despite being generated as a number between 0 and length=4 (and not performing any shrinking)

let index = Gen.int_bound (length-1) in

the Get-constructor's argument somehow ends up being -81244680...

That signals some form of heap corruption - like the assertion failure indicates. What makes you suspect #12889?

@gasche
Copy link

gasche commented Jun 3, 2024

My reasoning is that your test exercises the DLS primitives, and started failing in 5.2 and no older realease. #12889 is the only substantial change to the implementation of DLS that happened between 5.1 and 5.2, and it touches an unsafe part of the language (a mix of C runtime code and Obj.magic on the OCaml side). This could, of course, be an entirely unrelated issue, but then I wonder why it would only fail on this test precisely -- maybe the sheer luck of picking a favorable seed?

@jmid
Copy link
Collaborator Author

jmid commented Jun 4, 2024

This could, of course, be an entirely unrelated issue, but then I wonder why it would only fail on this test precisely -- maybe the sheer luck of picking a favorable seed?

With the debug-runtime strategy to trigger this, in the CI I'm now repeating 200 times a QCheck-test with count=1000 - with no hard-coded seeds. That makes for 200.000 arbitrary tests and gives a pretty clear signal.

@jmid
Copy link
Collaborator Author

jmid commented Jun 4, 2024

I've now completed a round of git bisect CI golf, and the finger points at:

  • bdd8d96 - Make the GC compact again (#12193)

with the latest run available here:
https://github.com/ocaml-multicore/multicoretests/actions/runs/9368325592/job/25790072295
Highlights:

  • 4 assertion failures file runtime/shared_heap.c; line 778 ### Assertion failed: Has_status_val(v, caml_global_heap_state.UNMARKED)
  • 1 clean segfault
  • 1 memory(?) corruption causing Get 78

Here's the log score-card from the golf round:

  • f87ea0b - first commit on branch 5.2 - BAD
  • cbc5a0c - Scan runtime/*.c for primitives less often (#12753) - BAD
  • df663ca - github-linguist: use M4Sugar for autoconf macros - BAD
  • 6601f1b - Merge pull request #12714 from MisterDA/cc-Wundef - BAD
  • bdd8d96 - Make the GC compact again (#12193) - BAD
  • 44d2127 - Warn on use of undefined macros - GOOD
  • 3d76183 - Guard against use of undefined macros in headers - GOOD
  • c6eec0c - Improve the documentation of {String,Byte}.split_on_char (#12717) - GOOD
  • d77bc97 - minor code cleanup - GOOD
  • 8cc728b - Merge pull request #12707 from jmid/domain-assert-failure - GOOD
  • ebae0d1 - Merge pull request #12461 from panglesd/parsetree-change-in-ci - GOOD
  • 560c444 - Merge pull request #12560 from OlivierNicole/improve_tsan_comments - GOOD
  • 6d0c551 - first commit after branching 5.1 - GOOD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime
Projects
None yet
Development

No branches or pull requests

2 participants