T0 prompt-RECO memory problem in run 379524 #44795

Dr15Jones · 2024-04-21T19:40:55Z

The prompt reco for run 379524 was failing for luminosity block 75 due to excessive memory usage:

https://cms-talk.web.cern.ch/t/high-memory-usage-for-promptreco-specialzerobias-pd/39326/6

After some investigation, type problem was isolated to one Event causing the problem
run 379524, lumi 75, event 178862574

.

Dr15Jones · 2024-04-21T19:41:03Z

assign reconstruction

cmsbuild · 2024-04-21T19:41:15Z

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2024-04-21T19:41:16Z

cms-bot internal usage

cmsbuild · 2024-04-21T19:41:17Z

A new Issue was created by @Dr15Jones.

@smuzaffar, @makortel, @rappoccio, @Dr15Jones, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones · 2024-04-21T19:45:23Z

One can run the job by replacing the output modules with AsciiOutputModules (to save time and space). If one un-tars the job files and copies the input file to a local directory, one can use the following script to run the job with the replaced output modules

from PSet import process
import FWCore.ParameterSet.Config as cms

process.source.fileNames[0] = 'file:10365d29-a39a-4e8a-83aa-2d0fe1a638a2.root'
#using the following will allow the job to succeed
#process.source.eventsToSkip = cms.untracked.VEventRange( [cms.EventRange("379524:75:178862574")])

# this should jump right to the problem event
#process.source.skipEvents = cms.untracked.uint32(12661)

for name,mod in process.outputModules.items():
    n = cms.OutputModule("AsciiOutputModule", outputCommands = mod.outputCommands, verbosity = cms.untracked.uint32(0))
    setattr(process,name,n)

mandrenguyen · 2024-04-22T11:53:19Z

Thanks @Dr15Jones
I confirm that I can reproduce the problem with the Pkl.
The memory is exceeding 15 GB RSS.

The reco is getting stuck in the following module, which is new for this year IIRC:
process.displacedRegionalStepSeedingVertices cms.EDProducer("DisplacedRegionSeedingVertexProducer",

@cms-sw/tracking-pog-l2 any ideas how to properly protect this module, e.g., from beam backgrounds?

mmusich · 2024-04-22T11:56:11Z

@aehart FYI

Dr15Jones · 2024-04-22T13:17:38Z

@mandrenguyen
Looking at the log I generated (which includes a prototype per module memory monitor) I see that the job also got stuck for 20 minutes in CkfTrackCandidateMaker:jetCoreRegionalStepEndcapTrackCandidates.

Here is some of the info I found from the memory monitor for this event

DisplacedRegionSeedingVertexProducer:displacedRegionalStepSeedingVertices
# allocs: 4,867,272,868
# dealocs 4,867,272,858
memory added to job after produce finished (bytes): 4,194,496
max memory allocated by the module during the event at one time (bytes): 5,697,859,040

The next largest memory users were
V0Producer:displacedRegionalStepSeedingV0Candidates
# allocs: 12,492,305
# deallocs: 12,100,008
memory added to job after produce finished (bytes):69,122,400
max memory allocated by the module during the event at one time (bytes):138,244,544

TrackProducer:jetCoreRegionalStepEndcapTracks
# allocs: 4,720,722
# deallocs: 4,605,171
memory added to job after produce finished (bytes):15,400,392
max memory allocated by the module during the event at one time (bytes):127,930,264

mandrenguyen · 2024-04-22T13:18:40Z

By adding the following lines to PSet.py the reconstruction finishes successfully:
process.displacedRegionalStepSeedingV0Candidates.tkPtCut = 9999. process.displacedRegionalStepSeedingV0Candidates.tkNHitsCut = 99

There are plenty of messages like the following, which might be a clue how to get the displacedRegionalStepSeedingVertices to quit early:

%MSG-w TooManyPFDVCandidates: PFDisplacedVertexProducer:particleFlowDisplacedVertex 22-Apr-2024 15:14:27 CEST Run: 379524 Event: 178862574 gave up vertex reco for 1254 tracks

mandrenguyen · 2024-04-22T13:20:01Z

@Dr15Jones Yes, I also notice jetCoreRegionalStepEndcapTrackCandidates

Although it takes a long time, displacedRegionalStepSeedingVertices looks even more problematic, so I figured we should start there.

dan131riley · 2024-04-22T13:23:55Z

If there's an event that takes a huge amount of resources, it very likely will be the last to complete, so you get a memory spike just before the endjob can complete.

We have other cases of RECO taking ridiculous resources for beam splash events, see #37362. Is this similar?

slava77 · 2024-04-22T13:24:47Z

By adding the following lines to PSet.py the reconstruction finishes successfully:
process.displacedRegionalStepSeedingV0Candidates.tkPtCut = 9999. process.displacedRegionalStepSeedingV0Candidates.tkNHitsCut = 99

doesn't this just disable the tracking in this iteration?

As in a few other cases, there should be just some cutoff to combinatorics.

slava77 · 2024-04-22T13:26:25Z

type tracking

Dr15Jones · 2024-04-22T13:27:11Z

We have other cases of RECO taking ridiculous resources for beam splash events, see #37362. Is this similar?

In this case, the event appeared early enough to basically kill the job 2/3rds the way through the LuminosityBlock. But if such an event appeared closer to the end, it would also spike just before the end job transition.

slava77 · 2024-04-22T13:28:06Z

We have other cases of RECO taking ridiculous resources for beam splash events, see #37362. Is this similar?

tracker is off during splashes

mandrenguyen · 2024-04-22T13:29:24Z

By adding the following lines to PSet.py the reconstruction finishes successfully:
process.displacedRegionalStepSeedingV0Candidates.tkPtCut = 9999. process.displacedRegionalStepSeedingV0Candidates.tkNHitsCut = 99

doesn't this just disable the tracking in this iteration?

As in a few other cases, there should be just some cutoff to combinatorics.

Right, I just wanted to check whether there was another problematic module downstream, but it appears not to be the case.
Of course you're right that some cutoff is needed for combinatorics. You're the expert :-)

davidlange6 · 2024-04-22T13:49:20Z

as coded, DisplacedRegionSeedingVertexProducer is going to do a lot of allocations given big events... what is its typical cpu fraction in a normal pu60 event?

Dr15Jones · 2024-04-22T13:54:25Z

as coded, DisplacedRegionSeedingVertexProducer is going to do a lot of allocations given big events... what is its typical cpu fraction in a normal pu60 event?

So grepping the log for that module's memory use, it looks like the vast majority of Events use at one time 260,000 bytes with one event using 5,686,768 and then the one pathological event using 5,697,859,040.

davidlange6 · 2024-04-22T14:16:58Z

https://github.com/cms-sw/cmssw/blob/master/RecoTracker/DisplacedRegionalTracking/plugins/DisplacedRegionSeedingVertexProducer.cc#L125

seems to be doing N^2 sorts on a potentially N^2 length list of something this O(100) bytes. Since whats being done is to find best pair of candidates in each iteration, one could probably could consider doing this list management with a few arrays (indices, is_valid)

Dr15Jones · 2024-04-22T14:46:53Z

So I did a log/log histogram of the max amount of memory used at one time by DisplacedRegionSeedingVertexProducer for each event (x axis) vs the number of events with that memory usage:

Dr15Jones · 2024-04-22T14:50:15Z

seems to be doing N^2 sorts on a potentially N^2 length list of something this O(100) bytes. Since whats being done is to find best pair of candidates in each iteration, one could probably could consider doing this list management with a few arrays (indices, is_valid)

The use of std::list probably isn't helping much either as it is probably the cause of the huge number of allocations/deallocations.

VinInn · 2024-04-23T09:33:06Z

The algorithm looks similar (in intent) to FastJet one...
Maybe somebody can investigate if using the fastJet Heap (that allows removal of any element not just pop)
would not be more efficient

see https://fastjet.fr/repo/doxygen-3.4.2/classfastjet_1_1MinHeap.html
how to use it?
https://fastjet.fr/repo/doxygen-3.2.0/ClusterSequence__TiledN2_8cc_source.html line 820
(compare with the slower methods above)

VinInn · 2024-04-23T09:35:26Z

Of course CMSSW is plenty of other, better behaving, cluster algorithms: reuse maybe worth.

davidlange6 · 2024-04-23T13:11:23Z

I looked a bit more - in the event in question

it takes 20 minutes on a pretty fast machine to even get to the problematic module (so there are other issues for this event)
there are 130k pseudoROIs (as each TrackCluster becomes a pseudoROI) (*88 bytes each is 10MB)
then there are 300M combos of pseudoROIs considered in the distance calculation (*16 bytes each is 5GB)

so changing from this Distance struct to just a vector of floats to cache the distances is probably already sufficient memory wise..

anyway, even after solving any memory issue, it looks like the algorithm will take many hours/days to run this event (limited by https://github.com/cms-sw/cmssw/blob/master/RecoTracker/DisplacedRegionalTracking/plugins/DisplacedVertexCluster.cc#L12 iiuc)

slava77 · 2024-04-23T13:30:50Z

I looked a bit more - in the event in question
* it takes 20 minutes on a pretty fast machine to even get to the problematic module (so there are other issues for this event)

* there are 130k `pseudoROIs` (as each `TrackCluster` becomes a `pseudoROI`) (*88 bytes each is 10MB)

* then there are 300M combos of `pseudoROIs` considered in the distance calculation (*16 bytes each is 5GB)
so changing from this Distance struct to just a vector of floats to cache the distances is probably already sufficient memory wise..

anyway, even after solving any memory issue, it looks like the algorithm will take many hours/days to run this event (limited by https://github.com/cms-sw/cmssw/blob/master/RecoTracker/DisplacedRegionalTracking/plugins/DisplacedVertexCluster.cc#L12 iiuc)

@aehart
is it practical to truncate pseudoROIs at 10K or will it affect some possible physics of (your) interest?

slava77 · 2024-04-23T14:48:48Z

is it practical to truncate pseudoROIs at 10K or will it affect some possible physics of (your) interest?

I got a response by email from Yuri: we are never supposed to have so many ROIs! A mild cut in it would be just fine.

Since @davidlange6 you already have this setup in place, perhaps you can check if a truncation at 10K solves the memory/timing problem.

davidlange6 · 2024-04-23T16:34:46Z

ha - i've rewritten a bunch of stuff... can have a look tomorrow I think.

The first 10k is sufficient? [that should certainly solve this..]

davidlange6 · 2024-04-23T20:48:24Z

it seems a limit of 10k means 1hr for the plugin to process this event.. Perhaps lower is more appropriate?

slava77 · 2024-04-23T21:44:53Z

it seems a limit of 10k means 1hr for the plugin to process this event.. Perhaps lower is more appropriate?

If the preceding parts take 20 mins, I'd aim for something smaller but perhaps not too greedy, so, aiming for 2 mins, perhaps.
I'm not sure which power is the explosion. Would 3K be enough?

mandrenguyen · 2024-04-24T06:19:56Z

it seems a limit of 10k means 1hr for the plugin to process this event.. Perhaps lower is more appropriate?

If the preceding parts take 20 mins, I'd aim for something smaller but perhaps not too greedy, so, aiming for 2 mins, perhaps. I'm not sure which power is the explosion. Would 3K be enough?

I'm not sure I would use the running time of preceding modules to decide where to put the limit. It appears there is a problem with jetCoreRegionalStepEndcapTracks that presumably should also be fixed by a seeding limit of one type or another. Barring that module the event is processed much faster.
Shouldn't we instead decide on the limit based on the distribution of pseudoROIs in normal collision events?
I think it's reasonably safe to assume this is some sort of beam background event.

davidlange6 · 2024-04-24T08:20:26Z

ok, my 1 hr is now like 1 minute after merging some code improvements.

#44829

mandrenguyen · 2024-05-16T18:35:21Z

+1
I think this is issue can be closed now after the fix from David:
#44829

cmsbuild · 2024-05-16T18:35:46Z

This issue is fully signed and ready to be closed.

makortel · 2024-05-16T18:38:35Z

@cmsbuild, please close

cmsbuild added reconstruction-pending pending-signatures labels Apr 21, 2024

cmsbuild added the tracking label Apr 22, 2024

davidlange6 mentioned this issue Apr 24, 2024

optimize and protect DisplacedRegionSeedingVertexProducer against busy events #44829

Merged

cmsbuild added reconstruction-approved fully-signed and removed reconstruction-pending pending-signatures labels May 16, 2024

cmsbuild closed this as completed May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T0 prompt-RECO memory problem in run 379524 #44795

T0 prompt-RECO memory problem in run 379524 #44795

Dr15Jones commented Apr 21, 2024

Dr15Jones commented Apr 21, 2024

cmsbuild commented Apr 21, 2024

cmsbuild commented Apr 21, 2024 •

edited

cmsbuild commented Apr 21, 2024

Dr15Jones commented Apr 21, 2024

mandrenguyen commented Apr 22, 2024

mmusich commented Apr 22, 2024

Dr15Jones commented Apr 22, 2024

mandrenguyen commented Apr 22, 2024

mandrenguyen commented Apr 22, 2024

dan131riley commented Apr 22, 2024

slava77 commented Apr 22, 2024

slava77 commented Apr 22, 2024

Dr15Jones commented Apr 22, 2024

slava77 commented Apr 22, 2024 •

edited

mandrenguyen commented Apr 22, 2024 •

edited

davidlange6 commented Apr 22, 2024

Dr15Jones commented Apr 22, 2024

davidlange6 commented Apr 22, 2024

Dr15Jones commented Apr 22, 2024

Dr15Jones commented Apr 22, 2024

VinInn commented Apr 23, 2024 •

edited

VinInn commented Apr 23, 2024

davidlange6 commented Apr 23, 2024

slava77 commented Apr 23, 2024

slava77 commented Apr 23, 2024

davidlange6 commented Apr 23, 2024

davidlange6 commented Apr 23, 2024

slava77 commented Apr 23, 2024

mandrenguyen commented Apr 24, 2024

davidlange6 commented Apr 24, 2024

mandrenguyen commented May 16, 2024

cmsbuild commented May 16, 2024

makortel commented May 16, 2024

T0 prompt-RECO memory problem in run 379524 #44795

T0 prompt-RECO memory problem in run 379524 #44795

Comments

Dr15Jones commented Apr 21, 2024

Dr15Jones commented Apr 21, 2024

cmsbuild commented Apr 21, 2024

cmsbuild commented Apr 21, 2024 • edited

cmsbuild commented Apr 21, 2024

Dr15Jones commented Apr 21, 2024

mandrenguyen commented Apr 22, 2024

mmusich commented Apr 22, 2024

Dr15Jones commented Apr 22, 2024

mandrenguyen commented Apr 22, 2024

mandrenguyen commented Apr 22, 2024

dan131riley commented Apr 22, 2024

slava77 commented Apr 22, 2024

slava77 commented Apr 22, 2024

Dr15Jones commented Apr 22, 2024

slava77 commented Apr 22, 2024 • edited

mandrenguyen commented Apr 22, 2024 • edited

davidlange6 commented Apr 22, 2024

Dr15Jones commented Apr 22, 2024

davidlange6 commented Apr 22, 2024

Dr15Jones commented Apr 22, 2024

Dr15Jones commented Apr 22, 2024

VinInn commented Apr 23, 2024 • edited

VinInn commented Apr 23, 2024

davidlange6 commented Apr 23, 2024

slava77 commented Apr 23, 2024

slava77 commented Apr 23, 2024

davidlange6 commented Apr 23, 2024

davidlange6 commented Apr 23, 2024

slava77 commented Apr 23, 2024

mandrenguyen commented Apr 24, 2024

davidlange6 commented Apr 24, 2024

mandrenguyen commented May 16, 2024

cmsbuild commented May 16, 2024

makortel commented May 16, 2024

cmsbuild commented Apr 21, 2024 •

edited

slava77 commented Apr 22, 2024 •

edited

mandrenguyen commented Apr 22, 2024 •

edited

VinInn commented Apr 23, 2024 •

edited