-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
T0 prompt-RECO memory problem in run 379524 #44795
Comments
assign reconstruction |
New categories assigned: reconstruction @jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks |
cms-bot internal usage |
A new Issue was created by @Dr15Jones. @smuzaffar, @makortel, @rappoccio, @Dr15Jones, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
One can run the job by replacing the output modules with AsciiOutputModules (to save time and space). If one un-tars the job files and copies the input file to a local directory, one can use the following script to run the job with the replaced output modules from PSet import process
import FWCore.ParameterSet.Config as cms
process.source.fileNames[0] = 'file:10365d29-a39a-4e8a-83aa-2d0fe1a638a2.root'
#using the following will allow the job to succeed
#process.source.eventsToSkip = cms.untracked.VEventRange( [cms.EventRange("379524:75:178862574")])
# this should jump right to the problem event
#process.source.skipEvents = cms.untracked.uint32(12661)
for name,mod in process.outputModules.items():
n = cms.OutputModule("AsciiOutputModule", outputCommands = mod.outputCommands, verbosity = cms.untracked.uint32(0))
setattr(process,name,n) |
Thanks @Dr15Jones The reco is getting stuck in the following module, which is new for this year IIRC: @cms-sw/tracking-pog-l2 any ideas how to properly protect this module, e.g., from beam backgrounds? |
@aehart FYI |
@mandrenguyen Here is some of the info I found from the memory monitor for this event DisplacedRegionSeedingVertexProducer:displacedRegionalStepSeedingVertices The next largest memory users were TrackProducer:jetCoreRegionalStepEndcapTracks |
By adding the following lines to PSet.py the reconstruction finishes successfully: There are plenty of messages like the following, which might be a clue how to get the
|
@Dr15Jones Yes, I also notice Although it takes a long time, |
If there's an event that takes a huge amount of resources, it very likely will be the last to complete, so you get a memory spike just before the endjob can complete. We have other cases of RECO taking ridiculous resources for beam splash events, see #37362. Is this similar? |
doesn't this just disable the tracking in this iteration? As in a few other cases, there should be just some cutoff to combinatorics. |
type tracking |
In this case, the event appeared early enough to basically kill the job 2/3rds the way through the LuminosityBlock. But if such an event appeared closer to the end, it would also spike just before the end job transition. |
tracker is off during splashes |
Right, I just wanted to check whether there was another problematic module downstream, but it appears not to be the case. |
as coded, |
So grepping the log for that module's memory use, it looks like the vast majority of Events use at one time 260,000 bytes with one event using 5,686,768 and then the one pathological event using 5,697,859,040. |
seems to be doing N^2 sorts on a potentially N^2 length list of something this O(100) bytes. Since whats being done is to find best pair of candidates in each iteration, one could probably could consider doing this list management with a few arrays (indices, is_valid) |
The use of std::list probably isn't helping much either as it is probably the cause of the huge number of allocations/deallocations. |
The algorithm looks similar (in intent) to FastJet one... see https://fastjet.fr/repo/doxygen-3.4.2/classfastjet_1_1MinHeap.html |
Of course CMSSW is plenty of other, better behaving, cluster algorithms: reuse maybe worth. |
I looked a bit more - in the event in question
so changing from this Distance struct to just a vector of floats to cache the distances is probably already sufficient memory wise.. anyway, even after solving any memory issue, it looks like the algorithm will take many hours/days to run this event (limited by https://github.com/cms-sw/cmssw/blob/master/RecoTracker/DisplacedRegionalTracking/plugins/DisplacedVertexCluster.cc#L12 iiuc) |
@aehart |
I got a response by email from Yuri: Since @davidlange6 you already have this setup in place, perhaps you can check if a truncation at 10K solves the memory/timing problem. |
ha - i've rewritten a bunch of stuff... can have a look tomorrow I think. The first 10k is sufficient? [that should certainly solve this..] |
it seems a limit of 10k means 1hr for the plugin to process this event.. Perhaps lower is more appropriate? |
If the preceding parts take 20 mins, I'd aim for something smaller but perhaps not too greedy, so, aiming for 2 mins, perhaps. |
I'm not sure I would use the running time of preceding modules to decide where to put the limit. It appears there is a problem with |
ok, my 1 hr is now like 1 minute after merging some code improvements. |
+1 |
This issue is fully signed and ready to be closed. |
@cmsbuild, please close |
The prompt reco for run 379524 was failing for luminosity block 75 due to excessive memory usage:
https://cms-talk.web.cern.ch/t/high-memory-usage-for-promptreco-specialzerobias-pd/39326/6
After some investigation, type problem was isolated to one Event causing the problem
run 379524, lumi 75, event 178862574
.
The text was updated successfully, but these errors were encountered: