Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TNetXNGFile::Open error when running xAH with AnalysisBase 25.2.9 on lxplus EL9 node #1688

Closed
mamerl opened this issue May 6, 2024 · 9 comments

Comments

@mamerl
Copy link
Contributor

mamerl commented May 6, 2024

Hi,

When running xAH_run.py on an lxplus EL9 node I encountered the following error messages at the end of the job:

Package.EventLoop        INFO    worker finished successfully                                                                                         
TNetXNGFile::Open         ERROR   [ERROR] Server responded with an error: [3011] Unable to open file /hist-data23_13p6TeV.00456225.physics_Main.deriv.DAOD_JETM1.f1369_m2185_p5994.root; No such file or directory                                                                                          
TNetXNGFile::Open         ERROR   [ERROR] Server responded with an error: [3011] Unable to open file /hist-data23_13p6TeV.00456225.physics_Main.deriv.DAOD_JETM1.f1369_m2185_p5994.root; No such file or directory                                                                                          
TNetXNGFile::Open         ERROR   [ERROR] Server responded with an error: [3011] Unable to open file /hist-data23_13p6TeV.00456225.physics_Main.deriv.DAOD_JETM1.f1369_m2185_p5994.root; No such file or directory                                                                                          
Package.EventLoop        INFO    done                                                                                                                 
TNetXNGFile::Open         ERROR   [ERROR] Server responded with an error: [3011] Unable to open file /driver.root; No such file or directory          
Package.EventLoop        INFO    done

I'm not sure whether this is an error in our analysis framework or something configuration related in terms of the location we are running from, etc.

Do you have any suggestions regarding where this may be coming from and how we can solve it in case it poses an issue @mdhank, @tofitsch?

From what I can see in the test job I ran, we get the usual output files containing the histograms we book with our custom algorithms and these do get filled.

Thanks,
Max

@mdhank
Copy link
Contributor

mdhank commented May 6, 2024

Hi @mamerl ,

Do you know if hist-data23_13p6TeV.00456225.physics_Main.deriv.DAOD_JETM1.f1369_m2185_p5994.root is an input file you are running on, or one of the output files? Or potentially something else?

Best,
Michael

@mamerl
Copy link
Contributor Author

mamerl commented May 7, 2024

Hi @mdhank,

From what I understand the hist file is the output file containing histograms in the submission directory here:

[mamerl@lxplus920 submitDir_TestR25Code_Data23Run456225_OnlineOverOfflineResponse_TLAJetCalibTest_withMC23OfflineJetRec]$ ll
total 958
drwxr-xr-x. 2 mamerl zp   4096 May  6 18:01 ..
-rw-r--r--. 1 mamerl zp   1390 May  6 18:01 driver.root
-rw-r--r--. 1 mamerl zp    135 May  6 18:01 location
drwxr-xr-x. 2 mamerl zp   4096 May  6 18:01 input
drwxr-xr-x. 2 mamerl zp   4096 May  6 18:01 hist
drwxr-xr-x. 2 mamerl zp   4096 May  6 18:01 data-cutflow
drwxr-xr-x. 2 mamerl zp   4096 May  6 18:01 data-duplicates_tree
-rw-r--r--. 1 mamerl zp 921430 May  6 18:01 hist-data23_13p6TeV.00456225.physics_Main.deriv.DAOD_JETM1.f1369_m2185_p5994.root
-rw-r--r--. 1 mamerl zp      0 May  6 18:01 submitted
drwxr-xr-x. 2 mamerl zp   4096 May  6 18:01 data-metadata
-rw-r--r--. 1 mamerl zp  15750 May  6 18:01 xAH_run.log
drwxr-xr-x. 2 mamerl zp   4096 May  6 18:01 output-cutflow
drwxr-xr-x. 2 mamerl zp   4096 May  6 18:01 output-metadata
drwxr-xr-x. 2 mamerl zp   4096 May  6 18:01 output-duplicates_tree
drwxr-xr-x. 2 mamerl zp   4096 May  6 18:04 .

Thanks,
Max

@mdhank
Copy link
Contributor

mdhank commented May 7, 2024

Hi Max,

I believe TNetXNGFile is for xrootd, but I'm not sure why it would need xrootd to open the output files. Could you send your full command and config file?

Best,
Michael

@mamerl
Copy link
Contributor Author

mamerl commented May 8, 2024

Hi Michael,

Thanks for clarifying that.

The config file we use is here: https://gitlab.cern.ch/tla-atlas-run3/tla-steering-run-3/-/blob/TLA-25.2.9-mamerl-dev/configs/onlineOverOffline/base_config_run3_withargs.py?ref_type=heads (I have added you to our TLA steering framework as a reporter so you can view the code).

The command we run is:

xAH_run.py --files /eos/user/m/mamerl/PhD/TLA/TestFiles/data23_13p6TeV.00456225.physics_Main.deriv.DAOD_JETM1.f1369_m2185_p5994/DAOD_JETM1.36218304._001074.pool.root.1 --config ../../configs/onlineOverOffline/base_config_run3_withargs.py --extraOptions=' --isData --dataYear 2023 --triggerList HLT_j0_perf_pf_ftf_L1RD0_FILLED HLT_j25_pf_ftf_L1RD0_FILLED HLT_j35_pf_ftf_L1RD0_FILLED --applyTriggerCut --runResponse --probeJetContainerNames HLT_AntiKt4EMPFlowJets_subresjesgscIS_ftf --probeJetCalibrations TLAMC23aMCCalibWithLArPileupBug --refJetCalibration Run3PhIIOfflinePreRec' --submitDir /eos/home-m/mamerl/tlarun3/hists/submitDir_TestR25Code_Data23Run456225_OnlineOverOfflineResponse_TLAJetCalibTest_withMC23OfflineJetRec --force direct

Thanks,
Max

@mdhank
Copy link
Contributor

mdhank commented May 8, 2024

Hi @mamerl ,

I ran some tests and was able to reproduce the error. It seems it occurs whenever submitDir is on eos (/eos/user/m/ vs. /eos/home-m/ makes no difference). If I use an AFS directory, there is no error. It's unclear if the error actually causes any problems, but I would recommend changing the output location to be on the safe side.

Best,
Michael

@mamerl
Copy link
Contributor Author

mamerl commented May 10, 2024

Hi @mdhank,

Thanks for clarifying that. At the moment, we only run locally from EOS for testing so won't be relying on any results from the local runs, and our large-scale jobs would be run on the grid. Would this error also affect submitting jobs to the grid with the prun driver when running from a repo stored on EOS? (maybe this is something I can test)

Thanks,
Max

@mdhank
Copy link
Contributor

mdhank commented May 10, 2024

Hi @mamerl ,

I don't think that would be a problem- it only gives the error when the output is on eos, not the input. I'll also note that even having the output on eos worked fine on a different file I tested, though I'm not sure why that makes a difference.

Best,
Michael

@mamerl
Copy link
Contributor Author

mamerl commented May 10, 2024

Hi @mdhank,

Thanks for pointing that out. In that case, we'll keep our workflow as it is and I'll test a few different jobs to see if I can replicate the same behaviour where we don't get the error when running on other files, etc.

Thanks for your help!

Cheers,
Max

@mamerl
Copy link
Contributor Author

mamerl commented May 15, 2024

Hi,

Just a follow up for documentation purposes. When I run jobs on the Grid with this setup (checking one job) I don't see the same error:

Package.EventLoop        INFO    worker finished successfully
Package.EventLoop        INFO    Loop finished.
Package.EventLoop        INFO    Read/processed 1009682 events.
Package.EventLoop        INFO    EventLoop Grid worker finished
Package.EventLoop        INFO    Saving output
xAOD::TFileAccessTracer   INFO    Sending file access statistics to http://rucio-lb-prod.cern.ch:18762/traces/
Finished executing root

So this seems to suggest that the issue is just linked to accessing files on EOS.

There was a similar error message for ROOT that was recently resolved here as well, but I'm not sure whether that can explain the error we see here since the file paths don't contain //.

Cheers,
Max

@mamerl mamerl closed this as completed Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants