-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent exit codes for fatal root errors and xrootd errors #44878
Comments
cms-bot internal usage |
A new Issue was created by @kpedro88. @smuzaffar, @sextonkennedy, @antoniovilela, @Dr15Jones, @rappoccio, @makortel can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core |
New categories assigned: core @Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Whose exit codes these 84, 85, 86 are? (like the exit code from cmsRun reported in the framework job report XML file, or error code from some intermediate WM tool) I'm asking because these values do not correspond to the codes the framework would report in the framework job report. In https://twiki.cern.ch/twiki/bin/view/CMSPublic/StandardExitCodes I noticed
and to me these seem to be reported by some other component than the framework. In order to improve the framework's exit code treatment we'd need to know the framework's exit codes in these situations. |
Assuming the 84, 85, 86 would be the process return values seen by the running script, the corresponding framework exit codes could be
|
This is back in 10_6_X (processing UL samples). It doesn't look like #42179 was backported that far back, but it's good to know about it. These exit codes are the result of |
Reminding myself more of that issue, it seems 10_6_X was not affected (i.e. 10_6_X was reporting 8021 / 85 properly). |
The behavior 4 of the 5 cases in the description appears to be as expected. The exception being the 3rd case, that we'd like to be reported as We could think of adding a new error code The error in case 5 is a bit complicated. It occurs at a time when the framework expects the Source to be reading data from a file whose open succeeded, the xrootd connection gets bad, and AAA ( At the framework level we can't really do a detailed analysis of the various causes of the errors. Any adjustment of the present error causes to categories (except for the case 3) would also need to communicated and agreed with WM and CRAB. |
Clearly there is some ability to distinguish between these cases within the software: the exception messages for cases 1 and 2 contain the phrase "Fatal Root Error", while the exception messages for cases 4 and 5 do not. What is the benefit of reporting these exceptions, which have vastly different causes, with the same code? Alternatively, what is the cost of correctly classifying these exceptions with different codes, using the information already available? The existing Regarding case 5, we have a |
Figuring out the causes reliably is difficult, especially when all we get is a string whose stability across external software updates is not guaranteed.
The
Presently the places that call reads do not know about the file being a "fallback file". What would be the benefit of |
Coming back to this point, as I mentioned above, the |
Yes, bad bits could be caused by a network error. But often it is caused by actual file corruption at the source. Erring on the side of highlighting possible file corruption seems like a better balance to me than grouping this in with obvious network errors that aren't even coming from ROOT. (My suggestion for case 5 was just based on your description of why it's complicated. My main goal here is to differentiate "file corrupted" from "transient network issue", which seems achievable at some level of accuracy already.) |
For the errors originating from the xrootd layer we could think of adding |
Kevin pointed me to this. FWIW I have seen tracking input file read errors for a while now and have seen all sort of errors to be either persistent (actuall bad file) or transient (error in the network an/or the storage server). It does not help that in many cases (truncated file e.g.) xrootd does not say where it was trying to read from. A |
Thanks @belforte for the feedback. Just to summarize, the present proposal would be to
In addition, just to make it clear (from @kpedro88's request), the While discussing with @kpedro88 I became to wonder the fate of
|
wow ! Can you really differentiate "not found" from "file open error" ? I think we should keep this simple. I was about to propose a detailed classification scheme but.. in the end what matters is to identify problem enough for being able to decide which corrective action to take. When we end up with "verify checksum and possibly re-transfer", it matters little if problem was in open or read in the middle of the file. And reliable and easy-to-find information about the name of the file being read. Now my code in CRAB "scrolls the log up" looking for last "open", ugly and fragile. That file was not found in primary catalog entry is not so important, there are many cases where jobs run at a site which does not have the input data local. We have outgrown considerably the initial "xrootd as fallback" and I do not see a turning back. And think about disk-less sites (we have some T3 like that already). |
At least in some cases we should be able to. E.g. in jobs that use our own I'm not sure though how reliably we could do that for all the protocols we register a handler in
This should already to large degree be the case. The "branch not found" should be
I fully agree.
This error message looks like something we might be able to improve (although it's also tricky, the layers above the xrootd do not know anything about the "actual file name", i.e. the PFN is the file name as far as they are concerned). Does that file still exist? (on the other hand it should be fairly easy to create a new truncated ROOT file)
I think we should have something in the framework job report too, but that might be limited to LFN or PFN. In this case, are you after the PFN, or the "actual URL xrootd was reading from"?
Ok, thanks. So if we would change the xrootd-related errors to be reported via a new |
indeed one can argue forever about "handling many/most while possibly obfuscating few" and get into a huge todo list with no major benefits. Let's stick to clear action items.
IIUC the new error will cover both As to "which file gave this error", anything which contains a LFN is good. The actual xrootd URL with remove site indication would be even better, but may not make a dramatic difference. Unrelated, quite possibly some protocols can be deprecated now. Facilities and SiteSupport (aka Stephan Lammel) is a good candidate to review the list. P.S. my truncated file is still there. FWIW I simply used |
When diagnosing failures in grid jobs, precise exit code meanings are very valuable. Currently, several exit codes are overloaded or used (in my opinion) inconsistently.
One category of errors is "Fatal Root Error". This occurs when there is some kind of file corruption. Corrupt files have to be fixed centrally, so it is very important to be able to isolate these. Here are some examples of the exit codes that can occur with "Fatal Root Error" and related exception messages:
exit code 84:
exit code 85:
exit code 86:
Another category of errors are xrootd issues. These are usually transient, though they can indicate that a file is not accessible anywhere on disk. Here are some examples:
exit code 84:
exit code 85:
I propose reserving exit code 84 for xrootd file open error (indicating file missing from disk), exit code 85 for xroot file read error (transient), and exit code 86 for all fatal root errors (indicating file corruption). I am open to other proposals to resolve these ambiguities.
The text was updated successfully, but these errors were encountered: