Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a -timeout option to ocamltest #10113

Merged
merged 4 commits into from Jan 5, 2021

Conversation

xavierleroy
Copy link
Contributor

This PR adds a -timeout N option to ocamltest, whose effect is to abort any external command that has not completed after N seconds, and report the corresponding test action as "failed".

It then uses this option to enforce a timeout of 10 minutes (per external command) when running the tests in testsuite/.

To accelerate the discussion, the remainder of this PR is in Q&A style.

What's the motivation?

Some of our regression tests involving threads have been known to deadlock in the past, and might still deadlock in the future, e.g. after updates to the C standard library that changes the corner-case behavior of some POSIX threads operations.

Deadlocks are obvious when running the test suite interactively, but wreck havoc on automated CI.

What about putting a timeout on the CI runs themselves?

That's what we do for the Inria CI (thanks @shindere), but I'm not fully satisfied.

  • The log just says "Testing killed on a timeout", but doesn't clearly say which test caused the timeout. This is especially true with make parallel in the test suite. So, it is necessary to re-run the test suite manually to identify the problematic test.
  • Testing stops at the first test that times out, while there may be several such tests.
  • The timeout must be pretty long (typically 45 minutes) because it must be bigger than the total running time of the CI job. So, deadlocks take longer to be noticed.

Should the timeout be configurable within each test, by setting an ocamltest variable in the ocamltest test header?

That might be nice, even though I found no use for it. But a global timeout is a must-have: we don't know in advance which tests may not terminate, and we don't want to add a timeout=600 line to every test file.

Why 10 minutes?

This timeout was chosen to be well above the execution time of any of our tests on any of our testing platforms, yet short enough that timeouts are reported relatively quickly. 5 minutes would probably be enough. A different timeout can be selected via make TIMEOUT=N all in testsuite/.

Is this real time or CPU time?

Real time. CPU time would require a rather different implementation under Unix, and may not be implementable under Windows.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature looks very sensible, and the implementation is simple given that there is already all the plumbing in ocamltest to support timeouts.

I'm not fond of the use of global mutable state for the default_timeout variable: it sticks out in a codebase that was careful to avoid it so far. My understanding is that ocamltest has two different mechanism for parametrization:

  1. if a parameter is set at invocation-time, it is exposed as a constant in the Options module
  2. if a parameter is meant to be overriden by test files, it uses the "variables" machinery (Variables, Environment)

My understanding is that you don't want to follow the longer route of (2); this is fine, but I think that (1) would be nicer (closer to the existing code) than the current patch.

DWORD elapsed = cur - stamp;
timeout = (timeout > elapsed ? timeout - elapsed : 0);
stamp = cur;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming that this is a refactoring. I'm not sure because I am not brave enough this afternoon to read the previous version carefully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the commit message says, it's a bug fix. The previous code would set timeout to 0 at the first iteration of the loop, causing all commands to timeout almost immediately. I didn't try to understand what was going wrong in the old code, just rewrote it to do what I think it should do, and voila, timeout is working.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the propensity of our CI systems to witness random events, the original code was attempting to deal with GetTickCount wrapping every other month! I wrote that a few years ago when we still cared about Windows XP - we could just switch to GetTickCount64 now.

I think timeout only becomes zero on the first iteration in the old code if cur == stamp... isn't the error just that the first deleted line should be cur >= stamp?). I think in your proposed version when GetTickCount wraps the timeout becomes effectively INFINITE because stamp never updates?

@xavierleroy
Copy link
Contributor Author

  1. if a parameter is set at invocation-time, it is exposed as a constant in the Options module

  2. if a parameter is meant to be overriden by test files, it uses the "variables" machinery (Variables, Environment)

My understanding is that you don't want to follow the longer route of (2); this is fine, but I think that (1) would be nicer (closer to the existing code) than the current patch.

I did follow (1) initially, but it causes dependency issues. Currently, the Options module is linked late, after the "core" and "ocaml_plugin" subsets of the source files. To get (1) working, Options would need to be inserted somewhere within the "core" subset, and it didn't feel right to me. A global reference used to set a global default is not evil.

Concerning (2), I'm lost in the Variables machinery. It seems that only strings are supported as values of variables (here we need an integer), and setting the default value of a variable via a command-line flag seems at least as hard as (1).

@gasche gasche force-pushed the ocamltest-timeout branch 2 times, most recently from 407132a to 35aff0c Compare January 2, 2021 20:09
@gasche
Copy link
Member

gasche commented Jan 2, 2021

I pushed a commit to your branch that does (2). It is more code, and I would not argue that it improves the implementation of this feature, but I think that consistency with the rest of the codebase is important.

Copy link
Contributor Author

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I respect your concern for consistency in the code base. Let's wait for @shindere 's review.

One suggestion for simplification below.

Comment on lines 69 to 72
begin match int_of_string value with
| exception _ -> None
| n -> Some n
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since 4.05 we have int_of_string_opt in the standard library...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And Option.bind since 4.08 too!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I amended my patch (still in Xavier's branch) to use int_of_string_opt as suggested, but not Option.bind. With the explicit pattern-match the code looks similar to the other functions around and, for example, it is clear that the | None -> ... case corresponds to what happens if the variable is not found. The Option.bind rewrite is more compact but makes code reasoning less obvious.

Copy link
Member

@dra27 dra27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good change! The part in run_win32.c comes from #1739 - I'm not sure it's the right fix, but equally, I've been staring at Makefiles for much of today, so am a bit cross-eyed...

I don't think having a per-test timeout would be a good feature, so thank you for not implementing it! The testsuite occasionally gets run on some slow platforms (e.g. esp. in qemu) where you don't "mind" that it's slow, so it's much easier to be able to set a global timeout and know that no test messes around with it.

In theory, this could have been combined with the recently deceased MAX_TESTSUITE_RETRIES (#10068) - have a much lower timeout the first time around and then just re-run the tests which failed with a higher one, but #10071 has put me off ever having automatic re-runs again!

Comment on lines 155 to 163
match timeout with
| Some timeout -> timeout
| None ->
begin match
Environments.lookup_as_int Builtin_variables.timeout env
with
| Some timeout -> timeout
| None -> 0
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
match timeout with
| Some timeout -> timeout
| None ->
begin match
Environments.lookup_as_int Builtin_variables.timeout env
with
| Some timeout -> timeout
| None -> 0
end
let timeout =
if timeout = None then
Environments.lookup_as_int Builtin_variables.timeout env
else
timeout
in
Option.value timeout ~default:0

I don't think this would ever be called without an environment based on the initial environment, so it could be Option.get at the end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used Option.value as suggested, but kept a layer of match timeout with; if we had a version of Option.value that would not force the "default" case, I would use it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume you mean force in the lazy sense rather than the writing sense (i.e. not Option.get?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, "forcing" is not the best word here, but value : 'a option -> default:'a will evaluate the "default" parameter even when it is not necessary, which is unpleasant for larger or potentially-failing computations. (Here the code is not performance-sensitive and the computation happens to be failure-free, but those are not obvious properties and I would rather write simple code than rely on them.)

ocamltest/actions_helpers.ml Outdated Show resolved Hide resolved
Comment on lines +300 to 296
timeout = timeout;
log = log;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
timeout = timeout;
log = log;
timeout;
log;

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh. Apparently this code was written before record punning was popular, and I don't think that using it only some of the definitions of the record would improve the codebase. (But feel free to help review existing ocamltest refactoring PRs (#9614) or provide your own!)

Comment on lines 69 to 72
begin match int_of_string value with
| exception _ -> None
| n -> Some n
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And Option.bind since 4.08 too!

DWORD elapsed = cur - stamp;
timeout = (timeout > elapsed ? timeout - elapsed : 0);
stamp = cur;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the propensity of our CI systems to witness random events, the original code was attempting to deal with GetTickCount wrapping every other month! I wrote that a few years ago when we still cared about Windows XP - we could just switch to GetTickCount64 now.

I think timeout only becomes zero on the first iteration in the old code if cur == stamp... isn't the error just that the first deleted line should be cur >= stamp?). I think in your proposed version when GetTickCount wraps the timeout becomes effectively INFINITE because stamp never updates?

@xavierleroy
Copy link
Contributor Author

Given the propensity of our CI systems to witness random events, the original code was attempting to deal with GetTickCount wrapping every other month! I wrote that a few years ago when we still cared about Windows XP - we could just switch to GetTickCount64 now.

I will push a commit that uses GetTickCount64 soon.

I think in your proposed version when GetTickCount wraps the timeout becomes effectively INFINITE because stamp never updates?

No. stamp never updates but timeout never updates either, so it remains finite, and GetQueuedCompletionStatus will eventually run out of events to report and time out. It's just that the time out may be bigger than specified.

@gasche gasche force-pushed the ocamltest-timeout branch 2 times, most recently from 8eaa673 to b460e16 Compare January 4, 2021 20:50
@shindere
Copy link
Contributor

shindere commented Jan 5, 2021 via email

xavierleroy added a commit to xavierleroy/ocaml that referenced this pull request Jan 5, 2021
xavierleroy and others added 4 commits January 5, 2021 15:59
This option sets a timeout (in seconds) that is applied to every external
command run from ocamltest.

It is useful to catch test that don't terminate and report them as
failing.

This is implemented by an ocamltest variable called "timeout"
whose initial value is that given by the -timeout option,
or 0 (no timeout) otherwise.

Co-authored-by: Gabriel Scherer <gabriel.scherer@gmail.com>
The original code caused the timeout duration to drop to 0 at some
of the spurious wake-ups, hence the command would time out early.

The new code uses GetTickCount64() instead of GetTickCount() to avoid
issues with GetTickCount() overflowing and wrapping around.
(As suggested by David Allsopp.)
By default, we allow a generous 10 minutes (600 seconds).

Can be overriden with `make TIMEOUT=...`.
@xavierleroy
Copy link
Contributor Author

Thank you @shindere for the review. I squashed and cleaned up as suggested, then ran CI precheck to check the Windows changes again. All 6 Windows ports are happy. I'll wait for the other CI and merge when they are happy.

@shindere
Copy link
Contributor

shindere commented Jan 5, 2021 via email

@xavierleroy
Copy link
Contributor Author

CI is green everywhere, merging!

@xavierleroy xavierleroy merged commit be48244 into ocaml:trunk Jan 5, 2021
@xavierleroy xavierleroy deleted the ocamltest-timeout branch January 5, 2021 17:48
dbuenzli pushed a commit to dbuenzli/ocaml that referenced this pull request Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants