Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup (grouped)limitedCandidates by changing the implementation of the candidates' number limitation. #44904

Merged
merged 11 commits into from
May 22, 2024

Conversation

VinInn
Copy link
Contributor

@VinInn VinInn commented May 5, 2024

Currenlty, in both (grouped)limitedCandidates, for each "step" candidates are first all collected and then the size is reduced based on some score.
This is equivalent to not "pushing" a new candidate if worse than the current worst once the collection has reached the max allowed size. (but for candidates with equal score).
The implementation has been therefore modified to this latter mechanism that is more performant.

It should also be noticed that the final sorting is not required and indeed in groupedLimitedCandidates there is NO final sort.

The Size of TempTrajectory has also been reduced to speed up move and swap.
Took the opportunity to reduce the size of Trajectory as well
and to cleanup in general CkfTrajectoryBuilder code.

In a HLT menu
time of limitedCandidates improves of 20%
while the one of groupedLimitedCandidates improves of a bit more than 10%

HLT throughput improves of a solid 1.5% at least.

Purely technical. Negligible regressions may appear in case of "candidates" with identical score.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2024

cms-bot internal usage

@VinInn
Copy link
Contributor Author

VinInn commented May 5, 2024

@cmsbuild please test

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2024

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44904/40174

  • This PR adds an extra 28KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2024

A new Pull Request was created by @VinInn for master.

It involves the following packages:

  • DataFormats/TrackCandidate (reconstruction)
  • TrackingTools/PatternTools (reconstruction)

@mandrenguyen, @jfernan2 can you please review it and eventually sign? Thanks.
@HuguesBrun, @jhgoh, @felicepantaleo, @gpetruc, @abbiendi, @VinInn, @mtosi, @mmusich, @bellan, @dgulhan, @andrea21z, @GiacomoSguazzoni, @CeliaFernandez, @missirol, @Fedespring, @cericeci, @rovere, @trocino, @VourMa, @JanFSchulte this is something you requested to watch as well.
@antoniovilela, @sextonkennedy, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2024

-1

Failed Tests: UnitTests RelVals RelVals-INPUT AddOn
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2dcb23/39240/summary.html
COMMIT: 9469803
CMSSW: CMSSW_14_1_X_2024-05-05-0000/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44904/39240/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 6 errors in the following unit tests:

---> test testTauEmbeddingWorkflow2016postVFP had ERRORS
---> test testTauEmbeddingWorkflow2016preVFP had ERRORS
---> test testTauEmbeddingWorkflow2017 had ERRORS
and more ...

RelVals

  • 4.53A fatal system signal has occurred: segmentation violation
  • 139.001A fatal system signal has occurred: segmentation violation
  • 140.023A fatal system signal has occurred: segmentation violation
Expand to see more relval errors ...

RelVals-INPUT

  • 4.64.6_MinimumBias2010A/step2_MinimumBias2010A.log
  • 138.4138.4_PromptCollisions2021/step2_PromptCollisions2021.log
  • 138.5138.5_ExpressCollisions2021/step2_ExpressCollisions2021.log
Expand to see more relval errors ...

AddOn Tests

A fatal system signal has occurred: segmentation violation
A fatal system signal has occurred: segmentation violation
A fatal system signal has occurred: segmentation violation
Expand to see more addon errors ...

@slava77
Copy link
Contributor

slava77 commented May 5, 2024

Took the opportunity to reduce the size of Trajectory as well.

is this from reordering the data members?

@slava77
Copy link
Contributor

slava77 commented May 5, 2024

Sorting of TempTrajectory in CkfTrajectoryBuilder::limitedCandidates takes almost 1% of HLT time.
Reducing it's size to two pointers this timing is reduced by half.

wouldn't it be more practical to update CkfTrajectoryBuilder::limitedCandidates to sort indices to trajectories (or did I misunderstand the comment/code)?

What's the cost of now having more scattered data in memory? Is it negligible? It would be nice to see some profiler or at least the timing piechart.

@VinInn
Copy link
Contributor Author

VinInn commented May 5, 2024

@slava77 , sorting indices and caching score. maybe yes. Was an option I considered. Not sure it is more practical.
The data are not really much more scattered in memory, there is just one more (fast) indirection.

As said the code is now faster at least for HLT. In reco it crashes for reasons to be understood (not obvious at all: it seems that not everything has ben recompiled?)
will now try some relval.

@VinInn
Copy link
Contributor Author

VinInn commented May 5, 2024

Took the opportunity to reduce the size of Trajectory as well.

is this from reordering the data members?

In part. also moving from short (int16) to uint8 (nhits is definetively less than 255. HitPattern is limited to 76)

@VinInn
Copy link
Contributor Author

VinInn commented May 5, 2024

crash reason understood: comes from a test of TempTrajectory validity.
A clever fix did not work. Will try another or just revert to a full default constructor...

@VinInn VinInn changed the title Speedup handling of (Temp)Trajectory by reducing their size Speedup (grouped)limitedCandidates by changing the implementation of the candidates' number's limitation. May 14, 2024
@VinInn VinInn changed the title Speedup (grouped)limitedCandidates by changing the implementation of the candidates' number's limitation. Speedup (grouped)limitedCandidates by changing the implementation of the candidates' number limitation. May 14, 2024
@VinInn
Copy link
Contributor Author

VinInn commented May 15, 2024

For what I'm concerned this PR is final.
Does TRK-POG or Reco have more questions?
(BTW I do not see them in the list of people notified....)

@slava77
Copy link
Contributor

slava77 commented May 15, 2024

Does TRK-POG or Reco have more questions?

I looked at the 1K events tracking comparisons in #44904 (comment) and had no comments, sorry for not being explicit.

I'm not sure I've seen a profiling/timing result re my comment from May 5

@VinInn
Copy link
Contributor Author

VinInn commented May 15, 2024

I reported on time improvements in the (updated) first comment

In a HLT menu
time of limitedCandidates improves of 20%
while the one of groupedLimitedCandidates improves of a bit more than 10%

HLT throughput improves of a solid 1.5% at least.

I can post results from perf later on.

@slava77
Copy link
Contributor

slava77 commented May 15, 2024

I'm not sure I've seen a profiling/timing result

I reported on time improvements in the (updated) first comment

thanks; apparently I was looking in time order and didn't check the PR description

@VinInn
Copy link
Contributor Author

VinInn commented May 15, 2024

this is the perf report for the release

image

and this other one for this PR

image

the 20% improvement for limitedCandidate is pretty evident
for the Grouped version is a 5%

@slava77
Copy link
Contributor

slava77 commented May 15, 2024

the 20% improvement for limitedCandidate is pretty evident

Do I understand correctly that it is closer to 12-15%, if the overall 5-7% decrease in unrelated modules like muonId or seedCreator are considered?

Is it fair to conclude also from looking at the unrelated modules that the grouped version is about the same if not slower?

@VinInn
Copy link
Contributor Author

VinInn commented May 15, 2024

@slava77 I can try to run with more events to make initialization count less

@VinInn
Copy link
Contributor Author

VinInn commented May 17, 2024

repeated the time measurements with 10K events and no GPU (one will notice now patatrack showing up)

This is the Release
image

and this the PR
image

I would say that the speed improvement in limitedCandidate is obvious and traceable to the reduction of "heap" operations.
In the grouped version the difference goes "in the noise".

@slava77
Copy link
Contributor

slava77 commented May 17, 2024

I would say that the speed improvement in limitedCandidate is obvious and traceable to the reduction of "heap" operations.

+1

In the grouped version the difference goes "in the noise".

both examples were worse. This time the unrelated modules are not visible to guess which direction the baseline is moving.
I don't see the *_heap operations in the baseline in the grouped, while it shows up at the bottom with the PR.

Is the test from HLT?
IIUC, the offline has a much larger weight in the grouped (even more so in a no-mkFit workflow/era). Perhaps a test there will be more convincing that the cost is at least not increasing.

@VinInn
Copy link
Contributor Author

VinInn commented May 17, 2024

this is offline: first column Release, second colum PR
[innocent@patatrack02 13034.0_TTbar_14TeV+2024PU]$ paste TimeOriTraj.txt TimeNewTraj.txt | awk '{print $1, $2, $4}'
convTrackCandidates 0.207506 0.206641
conversionTrackCandidates 0.157866 0.144111
cosmicsVetoTrackCandidates 0.001399 0.001399
detachedQuadStepTrackCandidates 0.001174 0.001139
detachedQuadStepTrackCandidatesMkFit 0.012637 0.012584
detachedQuadStepTrackCandidatesMkFitSeeds 0.000278 0.000279
detachedTripletStepTrackCandidates 0.007312 0.007274
detachedTripletStepTrackCandidatesMkFit 0.340068 0.340722
detachedTripletStepTrackCandidatesMkFitSeeds 0.003492 0.003338
displacedRegionalStepTrackCandidates 0.181286 0.180114
duplicateDisplacedTrackCandidates 0.000020 0.000020
duplicateTrackCandidates 0.104824 0.105132
electronCkfTrackCandidates 0.011372 0.011065
highPtTripletStepTrackCandidates 0.035885 0.035877
highPtTripletStepTrackCandidatesMkFit 0.096393 0.096400
highPtTripletStepTrackCandidatesMkFitSeeds 0.001586 0.001528
initialStepTrackCandidates 0.006030 0.006038
initialStepTrackCandidatesMkFit 0.087755 0.087670
initialStepTrackCandidatesMkFitPreSplitting 0.092760 0.092999
initialStepTrackCandidatesMkFitSeeds 0.001295 0.001252
initialStepTrackCandidatesMkFitSeedsPreSplitting 0.001381 0.001337
initialStepTrackCandidatesPreSplitting 0.006307 0.006293
jetCoreRegionalStepBarrelTrackCandidates 0.054636 0.054550
jetCoreRegionalStepEndcapTrackCandidates 0.092552 0.091719
lowPtGsfEleCkfTrackCandidates 0.009014 0.008760
lowPtQuadStepTrackCandidates 0.704589 0.700201
lowPtTripletStepTrackCandidates 0.554903 0.552126
mixedTripletStepTrackCandidates 0.032825 0.032708
muonSeededTrackCandidatesInOut 0.003372 0.003375
muonSeededTrackCandidatesOutIn 0.000696 0.000695
muonSeededTrackCandidatesOutInDisplaced 0.000559 0.000560
pixelLessStepTrackCandidates 0.011386 0.010974
pixelLessStepTrackCandidatesMkFit 0.706225 0.707222
pixelLessStepTrackCandidatesMkFitSeeds 0.021475 0.020879
pixelPairStepTrackCandidates 0.152920 0.152784
tobTecStepTrackCandidates 0.502028 0.503019
uncleanedOnlyConversionTrackCandidates 0.002595 0.002350
uncleanedOnlyElectronCkfTrackCandidates 0.000094 0.000091

@VinInn
Copy link
Contributor Author

VinInn commented May 17, 2024

Fro what concern unrelated modules:
there are two from Pixels, plus all the functions below AdvanceOneLayer such as gropuedMeasurements.
It's not easy to get stabler results

@slava77
Copy link
Contributor

slava77 commented May 17, 2024

this is offline: first column Release, second colum PR

totals after greping out initialStep\|detached\|highPt\|pixelLess and the rest would be more clear.

The results are pretty stable; it looks like the 3rd significant digit varies (things often vary more).

@VinInn
Copy link
Contributor Author

VinInn commented May 17, 2024

I leave it as exercise to the reader...
Do you agree that is not worse

@VinInn
Copy link
Contributor Author

VinInn commented May 17, 2024

For sustainability, I think it is better to keep the same implementation in both

@slava77
Copy link
Contributor

slava77 commented May 17, 2024

I leave it as exercise to the reader... Do you agree that is not worse

OK, only convTrack\|displacedRegional\|jetCore\|lowPt\|mixed\|muon\|pixelPair\|tobTec are using grouped; they add up to old new new/old: 2.48787 2.47849 0.99623 so, around 0.4% better.
The mkfit reference ratio is 1.00026.

For sustainability, I think it is better to keep the same implementation in both

fine

thanks for the inputs.
+1

@mmusich
Copy link
Contributor

mmusich commented May 21, 2024

+hlt

@jfernan2
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @antoniovilela, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@rappoccio
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 5132df0 into cms-sw:master May 22, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants