Add a `-timeout` option to ocamltest #10113

xavierleroy · 2021-01-02T10:02:35Z

This PR adds a -timeout N option to ocamltest, whose effect is to abort any external command that has not completed after N seconds, and report the corresponding test action as "failed".

It then uses this option to enforce a timeout of 10 minutes (per external command) when running the tests in testsuite/.

To accelerate the discussion, the remainder of this PR is in Q&A style.

What's the motivation?

Some of our regression tests involving threads have been known to deadlock in the past, and might still deadlock in the future, e.g. after updates to the C standard library that changes the corner-case behavior of some POSIX threads operations.

Deadlocks are obvious when running the test suite interactively, but wreck havoc on automated CI.

What about putting a timeout on the CI runs themselves?

That's what we do for the Inria CI (thanks @shindere), but I'm not fully satisfied.

The log just says "Testing killed on a timeout", but doesn't clearly say which test caused the timeout. This is especially true with make parallel in the test suite. So, it is necessary to re-run the test suite manually to identify the problematic test.
Testing stops at the first test that times out, while there may be several such tests.
The timeout must be pretty long (typically 45 minutes) because it must be bigger than the total running time of the CI job. So, deadlocks take longer to be noticed.

Should the timeout be configurable within each test, by setting an ocamltest variable in the ocamltest test header?

That might be nice, even though I found no use for it. But a global timeout is a must-have: we don't know in advance which tests may not terminate, and we don't want to add a timeout=600 line to every test file.

Why 10 minutes?

This timeout was chosen to be well above the execution time of any of our tests on any of our testing platforms, yet short enough that timeouts are reported relatively quickly. 5 minutes would probably be enough. A different timeout can be selected via make TIMEOUT=N all in testsuite/.

Is this real time or CPU time?

Real time. CPU time would require a rather different implementation under Unix, and may not be implementable under Windows.

gasche

The feature looks very sensible, and the implementation is simple given that there is already all the plumbing in ocamltest to support timeouts.

I'm not fond of the use of global mutable state for the default_timeout variable: it sticks out in a codebase that was careful to avoid it so far. My understanding is that ocamltest has two different mechanism for parametrization:

if a parameter is set at invocation-time, it is exposed as a constant in the Options module
if a parameter is meant to be overriden by test files, it uses the "variables" machinery (Variables, Environment)

My understanding is that you don't want to follow the longer route of (2); this is fine, but I think that (1) would be nicer (closer to the existing code) than the current patch.

gasche · 2021-01-02T14:18:45Z

ocamltest/run_win32.c

+       DWORD elapsed = cur - stamp;
+       timeout = (timeout > elapsed ? timeout - elapsed : 0);
+       stamp = cur;
+      }


I am assuming that this is a refactoring. I'm not sure because I am not brave enough this afternoon to read the previous version carefully.

As the commit message says, it's a bug fix. The previous code would set timeout to 0 at the first iteration of the loop, causing all commands to timeout almost immediately. I didn't try to understand what was going wrong in the old code, just rewrote it to do what I think it should do, and voila, timeout is working.

Given the propensity of our CI systems to witness random events, the original code was attempting to deal with GetTickCount wrapping every other month! I wrote that a few years ago when we still cared about Windows XP - we could just switch to GetTickCount64 now.

I think timeout only becomes zero on the first iteration in the old code if cur == stamp... isn't the error just that the first deleted line should be cur >= stamp?). I think in your proposed version when GetTickCount wraps the timeout becomes effectively INFINITE because stamp never updates?

xavierleroy · 2021-01-02T14:41:29Z

if a parameter is set at invocation-time, it is exposed as a constant in the Options module

if a parameter is meant to be overriden by test files, it uses the "variables" machinery (Variables, Environment)

My understanding is that you don't want to follow the longer route of (2); this is fine, but I think that (1) would be nicer (closer to the existing code) than the current patch.

I did follow (1) initially, but it causes dependency issues. Currently, the Options module is linked late, after the "core" and "ocaml_plugin" subsets of the source files. To get (1) working, Options would need to be inserted somewhere within the "core" subset, and it didn't feel right to me. A global reference used to set a global default is not evil.

Concerning (2), I'm lost in the Variables machinery. It seems that only strings are supported as values of variables (here we need an integer), and setting the default value of a variable via a command-line flag seems at least as hard as (1).

gasche · 2021-01-02T20:10:50Z

I pushed a commit to your branch that does (2). It is more code, and I would not argue that it improves the implementation of this feature, but I think that consistency with the rest of the codebase is important.

xavierleroy

I respect your concern for consistency in the code base. Let's wait for @shindere 's review.

One suggestion for simplification below.

xavierleroy · 2021-01-03T17:48:40Z

ocamltest/environments.ml

+      begin match int_of_string value with
+      | exception _ -> None
+      | n -> Some n
+      end


Since 4.05 we have int_of_string_opt in the standard library...

And Option.bind since 4.08 too!

I amended my patch (still in Xavier's branch) to use int_of_string_opt as suggested, but not Option.bind. With the explicit pattern-match the code looks similar to the other functions around and, for example, it is clear that the | None -> ... case corresponds to what happens if the variable is not found. The Option.bind rewrite is more compact but makes code reasoning less obvious.

dra27

It's a good change! The part in run_win32.c comes from #1739 - I'm not sure it's the right fix, but equally, I've been staring at Makefiles for much of today, so am a bit cross-eyed...

I don't think having a per-test timeout would be a good feature, so thank you for not implementing it! The testsuite occasionally gets run on some slow platforms (e.g. esp. in qemu) where you don't "mind" that it's slow, so it's much easier to be able to set a global timeout and know that no test messes around with it.

In theory, this could have been combined with the recently deceased MAX_TESTSUITE_RETRIES (#10068) - have a much lower timeout the first time around and then just re-run the tests which failed with a higher one, but #10071 has put me off ever having automatic re-runs again!

dra27 · 2021-01-03T20:40:32Z

ocamltest/actions_helpers.ml

+    match timeout with
+    | Some timeout -> timeout
+    | None ->
+        begin match
+          Environments.lookup_as_int Builtin_variables.timeout env
+        with
+        | Some timeout -> timeout
+        | None -> 0
+        end


Suggested change

match timeout with

| Some timeout -> timeout

| None ->

begin match

Environments.lookup_as_int Builtin_variables.timeout env

with

| Some timeout -> timeout

| None -> 0

end

let timeout =

if timeout = None then

Environments.lookup_as_int Builtin_variables.timeout env

else

timeout

in

Option.value timeout ~default:0

I don't think this would ever be called without an environment based on the initial environment, so it could be Option.get at the end

I used Option.value as suggested, but kept a layer of match timeout with; if we had a version of Option.value that would not force the "default" case, I would use it here.

I presume you mean force in the lazy sense rather than the writing sense (i.e. not Option.get?)

Yes, "forcing" is not the best word here, but value : 'a option -> default:'a will evaluate the "default" parameter even when it is not necessary, which is unpleasant for larger or potentially-failing computations. (Here the code is not performance-sensitive and the computation happens to be failure-free, but those are not obvious properties and I would rather write simple code than rely on them.)

ocamltest/actions_helpers.ml

dra27 · 2021-01-03T20:42:27Z

ocamltest/actions_helpers.ml

+    timeout = timeout;
    log = log;


Suggested change

timeout = timeout;

log = log;

timeout;

log;

?

Meh. Apparently this code was written before record punning was popular, and I don't think that using it only some of the definitions of the record would improve the codebase. (But feel free to help review existing ocamltest refactoring PRs (#9614) or provide your own!)

dra27 · 2021-01-03T20:48:45Z

ocamltest/environments.ml

+      begin match int_of_string value with
+      | exception _ -> None
+      | n -> Some n
+      end


And Option.bind since 4.08 too!

dra27 · 2021-01-03T21:12:12Z

ocamltest/run_win32.c

+       DWORD elapsed = cur - stamp;
+       timeout = (timeout > elapsed ? timeout - elapsed : 0);
+       stamp = cur;
+      }


Given the propensity of our CI systems to witness random events, the original code was attempting to deal with GetTickCount wrapping every other month! I wrote that a few years ago when we still cared about Windows XP - we could just switch to GetTickCount64 now.

I think timeout only becomes zero on the first iteration in the old code if cur == stamp... isn't the error just that the first deleted line should be cur >= stamp?). I think in your proposed version when GetTickCount wraps the timeout becomes effectively INFINITE because stamp never updates?

xavierleroy · 2021-01-04T09:24:00Z

Given the propensity of our CI systems to witness random events, the original code was attempting to deal with GetTickCount wrapping every other month! I wrote that a few years ago when we still cared about Windows XP - we could just switch to GetTickCount64 now.

I will push a commit that uses GetTickCount64 soon.

I think in your proposed version when GetTickCount wraps the timeout becomes effectively INFINITE because stamp never updates?

No. stamp never updates but timeout never updates either, so it remains finite, and GetQueuedCompletionStatus will eventually run out of events to report and time out. It's just that the time out may be bigger than specified.

shindere · 2021-01-05T11:09:36Z

Many thanks for all the hard work done in this PR, both by the author and the reviewers. The feature looks of course sensible to me. I also very much like the use of the environment that has been proposed by @gasche. That's what @damiendoligez and I had in mind when ocamltest was originally designed, with the idea to be able to set timeouts on a per-test basis, although it does indeed not make much sense, as was noted in the discussion. My only suggestion is to squash the first and fifth commits. It's not only nitpicking. Squashing the two commits will make it un-necessary to touch the `.depend` file, I think. Perhaps the commit message could be improved by making its title more precise (mention that the option is added to ocamltest). "in second" could also become "in seconds" (plural) as in the documentation string. And finally, regarding the fourth commit, how about also including the explanations about `_WIN32_WINNT` and the meaning of the comparison in terms of Windows versions also in the code?

This option sets a timeout (in seconds) that is applied to every external command run from ocamltest. It is useful to catch test that don't terminate and report them as failing. This is implemented by an ocamltest variable called "timeout" whose initial value is that given by the -timeout option, or 0 (no timeout) otherwise. Co-authored-by: Gabriel Scherer <gabriel.scherer@gmail.com>

The original code caused the timeout duration to drop to 0 at some of the spurious wake-ups, hence the command would time out early. The new code uses GetTickCount64() instead of GetTickCount() to avoid issues with GetTickCount() overflowing and wrapping around. (As suggested by David Allsopp.)

By default, we allow a generous 10 minutes (600 seconds). Can be overriden with `make TIMEOUT=...`.

xavierleroy · 2021-01-05T15:00:43Z

Thank you @shindere for the review. I squashed and cleaned up as suggested, then ran CI precheck to check the Windows changes again. All 6 Windows ports are happy. I'll wait for the other CI and merge when they are happy.

shindere · 2021-01-05T15:43:10Z

Xavier Leroy (2021/01/05 07:01 -0800):

Thank you @shindere for the review. I squashed and cleaned up as suggested, then ran CI precheck to check the Windows changes again. All 6 Windows ports are happy. I'll wait for the other CI and merge when they are happy.

Thank you Xavier for the nice work and for having taken the comments into account!

xavierleroy · 2021-01-05T17:48:31Z

CI is green everywhere, merging!

gasche reviewed Jan 2, 2021

View reviewed changes

gasche force-pushed the ocamltest-timeout branch 2 times, most recently from 407132a to 35aff0c Compare January 2, 2021 20:09

xavierleroy commented Jan 3, 2021

View reviewed changes

dra27 reviewed Jan 3, 2021

View reviewed changes

gasche force-pushed the ocamltest-timeout branch 2 times, most recently from 8eaa673 to b460e16 Compare January 4, 2021 20:50

xavierleroy force-pushed the ocamltest-timeout branch from b460e16 to 63ee9f4 Compare January 5, 2021 13:31

xavierleroy added a commit to xavierleroy/ocaml that referenced this pull request Jan 5, 2021

Changes for ocaml#10113

bfddbc9

xavierleroy and others added 4 commits January 5, 2021 15:59

Set a timeout for ocamltest invocations

c569899

By default, we allow a generous 10 minutes (600 seconds). Can be overriden with `make TIMEOUT=...`.

Changes for ocaml#10113

a00308f

xavierleroy force-pushed the ocamltest-timeout branch from bfddbc9 to a00308f Compare January 5, 2021 14:59

shindere approved these changes Jan 5, 2021

View reviewed changes

xavierleroy merged commit be48244 into ocaml:trunk Jan 5, 2021

xavierleroy deleted the ocamltest-timeout branch January 5, 2021 17:48

dbuenzli pushed a commit to dbuenzli/ocaml that referenced this pull request Mar 25, 2021

Changes for ocaml#10113

71511b1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a `-timeout` option to ocamltest #10113

Add a `-timeout` option to ocamltest #10113

xavierleroy commented Jan 2, 2021

gasche left a comment

gasche Jan 2, 2021

xavierleroy Jan 2, 2021

dra27 Jan 3, 2021

xavierleroy commented Jan 2, 2021

gasche commented Jan 2, 2021

xavierleroy left a comment

xavierleroy Jan 3, 2021

dra27 Jan 3, 2021

gasche Jan 4, 2021

dra27 left a comment

dra27 Jan 3, 2021

gasche Jan 4, 2021

dra27 Jan 4, 2021

gasche Jan 4, 2021

dra27 Jan 3, 2021

gasche Jan 4, 2021

dra27 Jan 3, 2021

dra27 Jan 3, 2021

xavierleroy commented Jan 4, 2021

shindere commented Jan 5, 2021 via email

xavierleroy commented Jan 5, 2021

shindere commented Jan 5, 2021 via email

xavierleroy commented Jan 5, 2021

Add a -timeout option to ocamltest #10113

Add a -timeout option to ocamltest #10113

Conversation

xavierleroy commented Jan 2, 2021

What's the motivation?

What about putting a timeout on the CI runs themselves?

Should the timeout be configurable within each test, by setting an ocamltest variable in the ocamltest test header?

Why 10 minutes?

Is this real time or CPU time?

gasche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xavierleroy commented Jan 2, 2021

gasche commented Jan 2, 2021

xavierleroy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dra27 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xavierleroy commented Jan 4, 2021

shindere commented Jan 5, 2021 via email

xavierleroy commented Jan 5, 2021

shindere commented Jan 5, 2021 via email

xavierleroy commented Jan 5, 2021

Add a `-timeout` option to ocamltest #10113

Add a `-timeout` option to ocamltest #10113