Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T0 prompt-RECO memory problem in run 379524 #44795

Closed
Dr15Jones opened this issue Apr 21, 2024 · 34 comments
Closed

T0 prompt-RECO memory problem in run 379524 #44795

Dr15Jones opened this issue Apr 21, 2024 · 34 comments

Comments

@Dr15Jones
Copy link
Contributor

The prompt reco for run 379524 was failing for luminosity block 75 due to excessive memory usage:

https://cms-talk.web.cern.ch/t/high-memory-usage-for-promptreco-specialzerobias-pd/39326/6

After some investigation, type problem was isolated to one Event causing the problem
run 379524, lumi 75, event 178862574

.

@Dr15Jones
Copy link
Contributor Author

assign reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 21, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @Dr15Jones.

@smuzaffar, @makortel, @rappoccio, @Dr15Jones, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor Author

One can run the job by replacing the output modules with AsciiOutputModules (to save time and space). If one un-tars the job files and copies the input file to a local directory, one can use the following script to run the job with the replaced output modules

from PSet import process
import FWCore.ParameterSet.Config as cms

process.source.fileNames[0] = 'file:10365d29-a39a-4e8a-83aa-2d0fe1a638a2.root'
#using the following will allow the job to succeed
#process.source.eventsToSkip = cms.untracked.VEventRange( [cms.EventRange("379524:75:178862574")])

# this should jump right to the problem event
#process.source.skipEvents = cms.untracked.uint32(12661)

for name,mod in process.outputModules.items():
    n = cms.OutputModule("AsciiOutputModule", outputCommands = mod.outputCommands, verbosity = cms.untracked.uint32(0))
    setattr(process,name,n)

@mandrenguyen
Copy link
Contributor

Thanks @Dr15Jones
I confirm that I can reproduce the problem with the Pkl.
The memory is exceeding 15 GB RSS.

The reco is getting stuck in the following module, which is new for this year IIRC:
process.displacedRegionalStepSeedingVertices cms.EDProducer("DisplacedRegionSeedingVertexProducer",

@cms-sw/tracking-pog-l2 any ideas how to properly protect this module, e.g., from beam backgrounds?

@mmusich
Copy link
Contributor

mmusich commented Apr 22, 2024

@aehart FYI

@Dr15Jones
Copy link
Contributor Author

@mandrenguyen
Looking at the log I generated (which includes a prototype per module memory monitor) I see that the job also got stuck for 20 minutes in CkfTrackCandidateMaker:jetCoreRegionalStepEndcapTrackCandidates.

Here is some of the info I found from the memory monitor for this event

DisplacedRegionSeedingVertexProducer:displacedRegionalStepSeedingVertices
# allocs: 4,867,272,868
# dealocs 4,867,272,858
memory added to job after produce finished (bytes): 4,194,496
max memory allocated by the module during the event at one time (bytes): 5,697,859,040

The next largest memory users were
V0Producer:displacedRegionalStepSeedingV0Candidates
# allocs: 12,492,305
# deallocs: 12,100,008
memory added to job after produce finished (bytes):69,122,400
max memory allocated by the module during the event at one time (bytes):138,244,544

TrackProducer:jetCoreRegionalStepEndcapTracks
# allocs: 4,720,722
# deallocs: 4,605,171
memory added to job after produce finished (bytes):15,400,392
max memory allocated by the module during the event at one time (bytes):127,930,264

@mandrenguyen
Copy link
Contributor

By adding the following lines to PSet.py the reconstruction finishes successfully:
process.displacedRegionalStepSeedingV0Candidates.tkPtCut = 9999. process.displacedRegionalStepSeedingV0Candidates.tkNHitsCut = 99

There are plenty of messages like the following, which might be a clue how to get the displacedRegionalStepSeedingVertices to quit early:

%MSG-w TooManyPFDVCandidates: PFDisplacedVertexProducer:particleFlowDisplacedVertex 22-Apr-2024 15:14:27 CEST Run: 379524 Event: 178862574 gave up vertex reco for 1254 tracks

@mandrenguyen
Copy link
Contributor

@Dr15Jones Yes, I also notice jetCoreRegionalStepEndcapTrackCandidates

Although it takes a long time, displacedRegionalStepSeedingVertices looks even more problematic, so I figured we should start there.

@dan131riley
Copy link

If there's an event that takes a huge amount of resources, it very likely will be the last to complete, so you get a memory spike just before the endjob can complete.

We have other cases of RECO taking ridiculous resources for beam splash events, see #37362. Is this similar?

@slava77
Copy link
Contributor

slava77 commented Apr 22, 2024

By adding the following lines to PSet.py the reconstruction finishes successfully:
process.displacedRegionalStepSeedingV0Candidates.tkPtCut = 9999. process.displacedRegionalStepSeedingV0Candidates.tkNHitsCut = 99

doesn't this just disable the tracking in this iteration?

As in a few other cases, there should be just some cutoff to combinatorics.

@slava77
Copy link
Contributor

slava77 commented Apr 22, 2024

type tracking

@Dr15Jones
Copy link
Contributor Author

We have other cases of RECO taking ridiculous resources for beam splash events, see #37362. Is this similar?

In this case, the event appeared early enough to basically kill the job 2/3rds the way through the LuminosityBlock. But if such an event appeared closer to the end, it would also spike just before the end job transition.

@slava77
Copy link
Contributor

slava77 commented Apr 22, 2024

We have other cases of RECO taking ridiculous resources for beam splash events, see #37362. Is this similar?

tracker is off during splashes

@mandrenguyen
Copy link
Contributor

mandrenguyen commented Apr 22, 2024

By adding the following lines to PSet.py the reconstruction finishes successfully:
process.displacedRegionalStepSeedingV0Candidates.tkPtCut = 9999. process.displacedRegionalStepSeedingV0Candidates.tkNHitsCut = 99

doesn't this just disable the tracking in this iteration?

As in a few other cases, there should be just some cutoff to combinatorics.

Right, I just wanted to check whether there was another problematic module downstream, but it appears not to be the case.
Of course you're right that some cutoff is needed for combinatorics. You're the expert :-)

@davidlange6
Copy link
Contributor

as coded, DisplacedRegionSeedingVertexProducer is going to do a lot of allocations given big events... what is its typical cpu fraction in a normal pu60 event?

@Dr15Jones
Copy link
Contributor Author

as coded, DisplacedRegionSeedingVertexProducer is going to do a lot of allocations given big events... what is its typical cpu fraction in a normal pu60 event?

So grepping the log for that module's memory use, it looks like the vast majority of Events use at one time 260,000 bytes with one event using 5,686,768 and then the one pathological event using 5,697,859,040.

@davidlange6
Copy link
Contributor

https://github.com/cms-sw/cmssw/blob/master/RecoTracker/DisplacedRegionalTracking/plugins/DisplacedRegionSeedingVertexProducer.cc#L125

seems to be doing N^2 sorts on a potentially N^2 length list of something this O(100) bytes. Since whats being done is to find best pair of candidates in each iteration, one could probably could consider doing this list management with a few arrays (indices, is_valid)

@Dr15Jones
Copy link
Contributor Author

So I did a log/log histogram of the max amount of memory used at one time by DisplacedRegionSeedingVertexProducer for each event (x axis) vs the number of events with that memory usage:
image

@Dr15Jones
Copy link
Contributor Author

seems to be doing N^2 sorts on a potentially N^2 length list of something this O(100) bytes. Since whats being done is to find best pair of candidates in each iteration, one could probably could consider doing this list management with a few arrays (indices, is_valid)

The use of std::list probably isn't helping much either as it is probably the cause of the huge number of allocations/deallocations.

@VinInn
Copy link
Contributor

VinInn commented Apr 23, 2024

The algorithm looks similar (in intent) to FastJet one...
Maybe somebody can investigate if using the fastJet Heap (that allows removal of any element not just pop)
would not be more efficient

see https://fastjet.fr/repo/doxygen-3.4.2/classfastjet_1_1MinHeap.html
how to use it?
https://fastjet.fr/repo/doxygen-3.2.0/ClusterSequence__TiledN2_8cc_source.html line 820
(compare with the slower methods above)

@VinInn
Copy link
Contributor

VinInn commented Apr 23, 2024

Of course CMSSW is plenty of other, better behaving, cluster algorithms: reuse maybe worth.

@davidlange6
Copy link
Contributor

I looked a bit more - in the event in question

  • it takes 20 minutes on a pretty fast machine to even get to the problematic module (so there are other issues for this event)
  • there are 130k pseudoROIs (as each TrackCluster becomes a pseudoROI) (*88 bytes each is 10MB)
  • then there are 300M combos of pseudoROIs considered in the distance calculation (*16 bytes each is 5GB)

so changing from this Distance struct to just a vector of floats to cache the distances is probably already sufficient memory wise..

anyway, even after solving any memory issue, it looks like the algorithm will take many hours/days to run this event (limited by https://github.com/cms-sw/cmssw/blob/master/RecoTracker/DisplacedRegionalTracking/plugins/DisplacedVertexCluster.cc#L12 iiuc)

@slava77
Copy link
Contributor

slava77 commented Apr 23, 2024

I looked a bit more - in the event in question

* it takes 20 minutes on a pretty fast machine to even get to the problematic module (so there are other issues for this event)

* there are 130k `pseudoROIs` (as each `TrackCluster` becomes a `pseudoROI`) (*88 bytes each is 10MB)

* then there are 300M combos of `pseudoROIs` considered in the distance calculation (*16 bytes each is 5GB)

so changing from this Distance struct to just a vector of floats to cache the distances is probably already sufficient memory wise..

anyway, even after solving any memory issue, it looks like the algorithm will take many hours/days to run this event (limited by https://github.com/cms-sw/cmssw/blob/master/RecoTracker/DisplacedRegionalTracking/plugins/DisplacedVertexCluster.cc#L12 iiuc)

@aehart
is it practical to truncate pseudoROIs at 10K or will it affect some possible physics of (your) interest?

@slava77
Copy link
Contributor

slava77 commented Apr 23, 2024

is it practical to truncate pseudoROIs at 10K or will it affect some possible physics of (your) interest?

I got a response by email from Yuri: we are never supposed to have so many ROIs! A mild cut in it would be just fine.

Since @davidlange6 you already have this setup in place, perhaps you can check if a truncation at 10K solves the memory/timing problem.

@davidlange6
Copy link
Contributor

ha - i've rewritten a bunch of stuff... can have a look tomorrow I think.

The first 10k is sufficient? [that should certainly solve this..]

@davidlange6
Copy link
Contributor

it seems a limit of 10k means 1hr for the plugin to process this event.. Perhaps lower is more appropriate?

@slava77
Copy link
Contributor

slava77 commented Apr 23, 2024

it seems a limit of 10k means 1hr for the plugin to process this event.. Perhaps lower is more appropriate?

If the preceding parts take 20 mins, I'd aim for something smaller but perhaps not too greedy, so, aiming for 2 mins, perhaps.
I'm not sure which power is the explosion. Would 3K be enough?

@mandrenguyen
Copy link
Contributor

it seems a limit of 10k means 1hr for the plugin to process this event.. Perhaps lower is more appropriate?

If the preceding parts take 20 mins, I'd aim for something smaller but perhaps not too greedy, so, aiming for 2 mins, perhaps. I'm not sure which power is the explosion. Would 3K be enough?

I'm not sure I would use the running time of preceding modules to decide where to put the limit. It appears there is a problem with jetCoreRegionalStepEndcapTracks that presumably should also be fixed by a seeding limit of one type or another. Barring that module the event is processed much faster.
Shouldn't we instead decide on the limit based on the distribution of pseudoROIs in normal collision events?
I think it's reasonably safe to assume this is some sort of beam background event.

@davidlange6
Copy link
Contributor

ok, my 1 hr is now like 1 minute after merging some code improvements.

#44829

@mandrenguyen
Copy link
Contributor

+1
I think this is issue can be closed now after the fix from David:
#44829

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

@cmsbuild, please close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants