Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ocaml5-issue] Windows trunk bytecode domain_spawntree crash or deadlock #354

Open
jmid opened this issue Jun 2, 2023 · 7 comments
Open
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime

Comments

@jmid
Copy link
Collaborator

jmid commented Jun 2, 2023

Today surfaced a Windows trunk bytecode crash on src/domain/domain_spawntree.ml
https://github.com/ocaml-multicore/multicoretests/actions/runs/5154525696/jobs/9283085877

random seed: 502502158
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
File "src/domain/dune", line 14, characters 7-23:
14 |  (name domain_spawntree)
            ^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code -1073741819.
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)
@jmid jmid added the ocaml5-issue A potential issue in the OCaml5 compiler/runtime label Jun 2, 2023
@jmid
Copy link
Collaborator Author

jmid commented Jun 15, 2023

Found another occurrence of this causing a live/deadlock:
https://github.com/ocaml-multicore/multicoretests/actions/runs/5242663626/jobs/9466351902

random seed: 320533040
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)Terminate batch job (Y/N)? 
^CFatal error: exception User interruption
Error: The operation was canceled.

@jmid
Copy link
Collaborator Author

jmid commented Aug 14, 2023

Observed another variant of this on the Mingw Windows 5.0.0 workflow
https://github.com/ocaml-multicore/multicoretests/actions/runs/5565429087/job/15072781545

random seed: 221745155
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
Fatal error: no domain lock held
File "src/domain/dune", line 14, characters 7-23:
14 |  (name domain_spawntree)
            ^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code 3.
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)

@jmid
Copy link
Collaborator Author

jmid commented Sep 7, 2023

Crash seen again on Mingw bytecode trunk:
https://github.com/ocaml-multicore/multicoretests/actions/runs/6093487146/job/16533243147

random seed: 373304996
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
File "src/domain/dune", line 14, characters 7-23:
14 |  (name domain_spawntree)
            ^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code -1073741819.
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)

@jmid
Copy link
Collaborator Author

jmid commented Sep 13, 2023

Just saw this as a deadlock on Mingw 5.1.0~rc3 (native, not bytecode):
https://github.com/ocaml-multicore/multicoretests/actions/runs/6160240834/job/16716723779?pr=395

random seed: 238601704
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)
[ ]   13    0    0   13 /  100    91.4s domain_spawntree - with Atomic
[ ]   21    0    0   21 /  100   199.6s domain_spawntree - with AtomicTerminate batch job (Y/N)? 
^CFatal error: exception User interruption
Error: The operation was canceled.

@jmid jmid changed the title [ocaml5-issue] Windows trunk bytecode domain_spawntree crash [ocaml5-issue] Windows trunk bytecode domain_spawntree crash or deadlock Oct 11, 2023
@shym
Copy link
Collaborator

shym commented Dec 11, 2023

Observed on a MSVC-restoring branch (so on current trunk):
https://github.com/shym/multicoretests/actions/runs/7169794449/job/19520999732#step:17:92

random seed: 529644456
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
Fatal error: Failed to create domain
Fatal error: File "src/domain/dune", line 14, characters 7-23:
14 |  (name domain_spawntree)
            ^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code -1073740791.
[ ]    0    0    0    0 /  [100](https://github.com/shym/multicoretests/actions/runs/7169794449/job/19520999732#step:17:101)     0.0s domain_spawntree - with Atomic (generating)

@shym
Copy link
Collaborator

shym commented Dec 19, 2023

Error -1073740791 seems to happen very consistently on the MSVC port, the latest instance being:

random seed: 405994358
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)
Fatal error: Failed to create domain
File "src/domain/dune", line 14, characters 7-23:
14 |  (name domain_spawntree)
            ^^^^^^^^^^^^^^^^
(cd _build/default/src/domain && ./domain_spawntree.exe --verbose)
Command exited with code -1073740791.

but also with seeds 437567822, 428257872,...

According to MS documentation, -1073740791 (aka 0xc0000409) is:

STATUS_STACK_BUFFER_OVERRUN: The system detected an overrun of a stack-based buffer in this application. This overrun could potentially allow a malicious user to gain control of this application.

and -1073741819 (aka 0xc0000005) is:

STATUS_ACCESS_VIOLATION: The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

Those sound like two nuances of segfaults.
Could the differences of error codes bring any light on the cause or, on the contrary, suggest they are separate issues?

@shym
Copy link
Collaborator

shym commented Dec 21, 2023

Debugging this further, it seems that the 0xC0000409 errors I saw on the MSVC port where caused by the abort as tracked in #428. So it would be two different things indeed.

@jmid jmid mentioned this issue Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime
Projects
None yet
Development

No branches or pull requests

2 participants