-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
twister: handle exceptions when running binaries #72893
twister: handle exceptions when running binaries #72893
Conversation
similiar with #72892, but the fix in this PR looks more considerate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read this correctly, this solution is even less general than the one in #72892: at least that one caught exceptions in any Handler
, this one is specific to the BinaryHandler
.
There's something missing further up the call/exec stack. Without the try/except, the exception propagated up to somewhere where it got printed, that's where a check is needed.
33c4034
to
14768cc
Compare
ok, makes sense. Updated to be more general. |
I still feel like there's something missing. Previously |
Maybe it helps if I reword it: somewhere, an error condition is being swallowed, i.e. the error is noticed, but not passed on to eventually fail the CI run. This PR does not address that: the error wasn't being swallowed in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, add filling of the device.log
logger.error(f"Error: {e}") | ||
self.instance.status = "error" | ||
self.instance.reason = "general runtime exception" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d_log = f"{self.instance.build_dir}/device.log" | |
with open(d_log, "a") as dlog_fp: | |
dlog_fp.write(str(e)) | |
without it reports contain useless build.log.
I am still not sure what are you referring to. With the current version twister will exit with exitcode==1, because there were errors in the run: https://github.com/zephyrproject-rtos/zephyr/blob/main/scripts/pylib/twister/twisterlib/twister_main.py#L236 |
The original problem was that the runner crashed (the reason doesn't matter), half the tests weren't being run, but CI was still completely green. With the changes in this PR, an exception being thrown by
If the runner crashes in a way that it can't handle and report by itself, then that is exactly what needs to happen. Catastrophic failure needs to result in more errors, not no errors. Let's approach it like this: which piece of code starts the runner? Why doesn't it notice and report when the runner crashes? |
I believe I finally understood what do you mean. I saw the original error in our CI, but since there are several workers (I think this is what you call runner?) running it didn't look like something was missing, beside those unit tests. In logs I saw errors (not reported), but it looks like it didn't skip on running anything else. However, there was a huge scope, varying in between runs, so it was impossible to see if sth was not executed.
After the worker failed, nothing more was executed, and summary was printed without reporting errors (although passed was also 0, but who looks at it) and twister exit with 0. |
I think what should happen is that twister simply exits with a nonzero exit code. If a worker crashes so hard that it can't even report back its failure, it's fine to kill the CI run, investigate and fix the problem, then retry the CI run. It shouldn't happen often. |
@arbrauns What about this? #72945 When I added also #72893 into the mix I was no longer getting Pipeline errors, since those unit_testing errors where already handled. |
14768cc
to
4ef5243
Compare
Well, those are not pipeline errors, not sure what this means exactly, as the bug fix for the binary shows, it was an issue in twister itself trying to run the wrong binary, something we should abort on and stop execution ASAP. So I agree with @arbrauns completely and the latest revision in this PR just does that, stop twister and do not try to push this into logs an pretent all is good. |
Of course it's better if it's handled gracefully, but the core problem here is that errors which aren't being handled gracefully (of which there are infinitely many, you can't possibly cover all error scenarios) simply vanish without a trace. #72945 is yet another bandaid that gracefully handles one error case, but it doesn't fix the root problem. @nashif after looking into it deeper, I think what's actually missing is a check for |
yes, but this will be very difficult to handle dealing with multiprocessing, I could check for this and exist but it will also be nice to catpture the backtrace and error, not only react on an exit code. |
4ef5243
to
ab8761f
Compare
It is not a bandaid for one error case, but the whole set of exceptions you want to catch if you don't want to be specific The solution from this PR uses the exact same mechanism at the same line i.e. wrapping pb.process from L1325 with try/except Exception. My PR detects the same set of errors and marks tests causing these errors accordingly, allowing to go forward. This PR instead adds a result to the results_queue and adds new mechanism of scanning the results_queue for error and exiting when one is found. @nashif what is the benefit of doing this extra queuing and scanning, instead of just making twister to exit with error, as a part of
It will go over all 62 configurations, erroring on each, and only then error-exit (without any summary). My PR will set results to errors for all those cases, put according error traces in reports (one can see full error in twister logs with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation won't exit unit the whole scope is finished.
Also, please have a look at #72945. I think it is a better approach (but of course I'm biased).
The main problem here is the fact that the "errors" or failures are not test related but issues in the runner, and test should not generate an exception in the python code of the runner. Just counting those as test errors because of wrong logic or something gone wrong in twister is not a good idea, it should just quit and a fix need to be provided. |
@nashif I am fine with such approach, just proposed a way to see what was the problem and still be able to execute the whole scope. My +/-1 refers to the fact that your current implementation checks the queue only after every test in the scope is evaluated. So instead of quitting on the first pipeline error, one has to wait potentially till the end to just see twister crushing. |
ab8761f
to
37b8dd5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@LukaszMrugala @KamilxPaszkiet can you help with the failing test please? |
When something goes wrong with execution, due to twister issues and bugs, do not continue with execution and abort the process and report and return an error code. Fixes zephyrproject-rtos#72807 Signed-off-by: Anas Nashif <anas.nashif@intel.com>
37b8dd5
to
b69adeb
Compare
@gchwier please review.... |
Deal with binaries not found. Also set default return code to something
other than zero to catch issues and report correct status.
Fixes #72807
Signed-off-by: Anas Nashif anas.nashif@intel.com