New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a -timeout
option to ocamltest
#10113
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature looks very sensible, and the implementation is simple given that there is already all the plumbing in ocamltest
to support timeouts.
I'm not fond of the use of global mutable state for the default_timeout
variable: it sticks out in a codebase that was careful to avoid it so far. My understanding is that ocamltest
has two different mechanism for parametrization:
- if a parameter is set at invocation-time, it is exposed as a constant in the
Options
module - if a parameter is meant to be overriden by test files, it uses the "variables" machinery (
Variables
,Environment
)
My understanding is that you don't want to follow the longer route of (2); this is fine, but I think that (1) would be nicer (closer to the existing code) than the current patch.
DWORD elapsed = cur - stamp; | ||
timeout = (timeout > elapsed ? timeout - elapsed : 0); | ||
stamp = cur; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming that this is a refactoring. I'm not sure because I am not brave enough this afternoon to read the previous version carefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the commit message says, it's a bug fix. The previous code would set timeout
to 0 at the first iteration of the loop, causing all commands to timeout almost immediately. I didn't try to understand what was going wrong in the old code, just rewrote it to do what I think it should do, and voila, timeout is working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the propensity of our CI systems to witness random events, the original code was attempting to deal with GetTickCount
wrapping every other month! I wrote that a few years ago when we still cared about Windows XP - we could just switch to GetTickCount64
now.
I think timeout
only becomes zero on the first iteration in the old code if cur == stamp
... isn't the error just that the first deleted line should be cur >= stamp
?). I think in your proposed version when GetTickCount
wraps the timeout becomes effectively INFINITE
because stamp
never updates?
I did follow (1) initially, but it causes dependency issues. Currently, the Options module is linked late, after the "core" and "ocaml_plugin" subsets of the source files. To get (1) working, Options would need to be inserted somewhere within the "core" subset, and it didn't feel right to me. A global reference used to set a global default is not evil. Concerning (2), I'm lost in the Variables machinery. It seems that only strings are supported as values of variables (here we need an integer), and setting the default value of a variable via a command-line flag seems at least as hard as (1). |
407132a
to
35aff0c
Compare
I pushed a commit to your branch that does (2). It is more code, and I would not argue that it improves the implementation of this feature, but I think that consistency with the rest of the codebase is important. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I respect your concern for consistency in the code base. Let's wait for @shindere 's review.
One suggestion for simplification below.
ocamltest/environments.ml
Outdated
begin match int_of_string value with | ||
| exception _ -> None | ||
| n -> Some n | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since 4.05 we have int_of_string_opt
in the standard library...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And Option.bind
since 4.08 too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I amended my patch (still in Xavier's branch) to use int_of_string_opt
as suggested, but not Option.bind
. With the explicit pattern-match the code looks similar to the other functions around and, for example, it is clear that the | None -> ...
case corresponds to what happens if the variable is not found. The Option.bind
rewrite is more compact but makes code reasoning less obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good change! The part in run_win32.c comes from #1739 - I'm not sure it's the right fix, but equally, I've been staring at Makefile
s for much of today, so am a bit cross-eyed...
I don't think having a per-test timeout would be a good feature, so thank you for not implementing it! The testsuite occasionally gets run on some slow platforms (e.g. esp. in qemu) where you don't "mind" that it's slow, so it's much easier to be able to set a global timeout and know that no test messes around with it.
In theory, this could have been combined with the recently deceased MAX_TESTSUITE_RETRIES
(#10068) - have a much lower timeout the first time around and then just re-run the tests which failed with a higher one, but #10071 has put me off ever having automatic re-runs again!
ocamltest/actions_helpers.ml
Outdated
match timeout with | ||
| Some timeout -> timeout | ||
| None -> | ||
begin match | ||
Environments.lookup_as_int Builtin_variables.timeout env | ||
with | ||
| Some timeout -> timeout | ||
| None -> 0 | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match timeout with | |
| Some timeout -> timeout | |
| None -> | |
begin match | |
Environments.lookup_as_int Builtin_variables.timeout env | |
with | |
| Some timeout -> timeout | |
| None -> 0 | |
end | |
let timeout = | |
if timeout = None then | |
Environments.lookup_as_int Builtin_variables.timeout env | |
else | |
timeout | |
in | |
Option.value timeout ~default:0 |
I don't think this would ever be called without an environment based on the initial environment, so it could be Option.get
at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used Option.value
as suggested, but kept a layer of match timeout with
; if we had a version of Option.value
that would not force the "default" case, I would use it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume you mean force in the lazy sense rather than the writing sense (i.e. not Option.get
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, "forcing" is not the best word here, but value : 'a option -> default:'a
will evaluate the "default" parameter even when it is not necessary, which is unpleasant for larger or potentially-failing computations. (Here the code is not performance-sensitive and the computation happens to be failure-free, but those are not obvious properties and I would rather write simple code than rely on them.)
timeout = timeout; | ||
log = log; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeout = timeout; | |
log = log; | |
timeout; | |
log; |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meh. Apparently this code was written before record punning was popular, and I don't think that using it only some of the definitions of the record would improve the codebase. (But feel free to help review existing ocamltest refactoring PRs (#9614) or provide your own!)
ocamltest/environments.ml
Outdated
begin match int_of_string value with | ||
| exception _ -> None | ||
| n -> Some n | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And Option.bind
since 4.08 too!
DWORD elapsed = cur - stamp; | ||
timeout = (timeout > elapsed ? timeout - elapsed : 0); | ||
stamp = cur; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the propensity of our CI systems to witness random events, the original code was attempting to deal with GetTickCount
wrapping every other month! I wrote that a few years ago when we still cared about Windows XP - we could just switch to GetTickCount64
now.
I think timeout
only becomes zero on the first iteration in the old code if cur == stamp
... isn't the error just that the first deleted line should be cur >= stamp
?). I think in your proposed version when GetTickCount
wraps the timeout becomes effectively INFINITE
because stamp
never updates?
I will push a commit that uses GetTickCount64 soon.
No. |
8eaa673
to
b460e16
Compare
Many thanks for all the hard work done in this PR, both by the author
and the reviewers.
The feature looks of course sensible to me. I also very much like the
use of the environment that has been proposed by @gasche. That's what
@damiendoligez and I had in mind when ocamltest was originally designed,
with the idea to be able to set timeouts on a per-test basis, although
it does indeed not make much sense, as was noted in the discussion.
My only suggestion is to squash the first and fifth commits. It's not
only nitpicking. Squashing the two commits will make it un-necessary to
touch the `.depend` file, I think.
Perhaps the commit message could be improved by making its title more
precise (mention that the option is added to ocamltest). "in second"
could also become "in seconds" (plural) as in the documentation string.
And finally, regarding the fourth commit, how about also including the
explanations about `_WIN32_WINNT` and the meaning of the comparison in
terms of Windows versions also in the code?
|
b460e16
to
63ee9f4
Compare
This option sets a timeout (in seconds) that is applied to every external command run from ocamltest. It is useful to catch test that don't terminate and report them as failing. This is implemented by an ocamltest variable called "timeout" whose initial value is that given by the -timeout option, or 0 (no timeout) otherwise. Co-authored-by: Gabriel Scherer <gabriel.scherer@gmail.com>
The original code caused the timeout duration to drop to 0 at some of the spurious wake-ups, hence the command would time out early. The new code uses GetTickCount64() instead of GetTickCount() to avoid issues with GetTickCount() overflowing and wrapping around. (As suggested by David Allsopp.)
By default, we allow a generous 10 minutes (600 seconds). Can be overriden with `make TIMEOUT=...`.
bfddbc9
to
a00308f
Compare
Thank you @shindere for the review. I squashed and cleaned up as suggested, then ran CI precheck to check the Windows changes again. All 6 Windows ports are happy. I'll wait for the other CI and merge when they are happy. |
Xavier Leroy (2021/01/05 07:01 -0800):
Thank you @shindere for the review. I squashed and cleaned up as
suggested, then ran CI precheck to check the Windows changes again.
All 6 Windows ports are happy. I'll wait for the other CI and merge
when they are happy.
Thank you Xavier for the nice work and for having taken the comments
into account!
|
CI is green everywhere, merging! |
This PR adds a
-timeout
N option toocamltest
, whose effect is to abort any external command that has not completed after N seconds, and report the corresponding test action as "failed".It then uses this option to enforce a timeout of 10 minutes (per external command) when running the tests in
testsuite/
.To accelerate the discussion, the remainder of this PR is in Q&A style.
What's the motivation?
Some of our regression tests involving threads have been known to deadlock in the past, and might still deadlock in the future, e.g. after updates to the C standard library that changes the corner-case behavior of some POSIX threads operations.
Deadlocks are obvious when running the test suite interactively, but wreck havoc on automated CI.
What about putting a timeout on the CI runs themselves?
That's what we do for the Inria CI (thanks @shindere), but I'm not fully satisfied.
make parallel
in the test suite. So, it is necessary to re-run the test suite manually to identify the problematic test.Should the timeout be configurable within each test, by setting an ocamltest variable in the ocamltest test header?
That might be nice, even though I found no use for it. But a global timeout is a must-have: we don't know in advance which tests may not terminate, and we don't want to add a
timeout=600
line to every test file.Why 10 minutes?
This timeout was chosen to be well above the execution time of any of our tests on any of our testing platforms, yet short enough that timeouts are reported relatively quickly. 5 minutes would probably be enough. A different timeout can be selected via
make TIMEOUT=N all
intestsuite/
.Is this real time or CPU time?
Real time. CPU time would require a rather different implementation under Unix, and may not be implementable under Windows.