Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alpaka implementation of Hcal Local Reconstruction (Mahi) #44910

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

kakwok
Copy link
Contributor

@kakwok kakwok commented May 6, 2024

PR description:

This PR port the CUDA implementation of Hcal Local Reconstruction (Mahi) to using Alpaka.
Custom SoA data structure used in CUDA for HCAL condition data and rechits are replaced with PortableCollection.
The current Alpaka implementation aims at reproducing the results from the CUDA implementation, no algorithmic changes are made.

There are 4 main pieces involved in the migration:

  • digiConverter(hcalDigisProducerPortable): Convert CPU digis into SoA format
  • Produce HCAL condition data in SoA format (Multiple producers)
  • Mahi kernels (Mahi.dev.cc)
  • Convert rechits from SoA to legacy format (HcalRecHitSoAToLegacy)

More details on code design are presented in the recent HLT GPU development meetings:
https://indico.cern.ch/event/1350955/
https://indico.cern.ch/event/1350953/
https://indico.cern.ch/event/1350952/
https://indico.cern.ch/event/1230377/
https://indico.cern.ch/event/1230374/

PR validation:

Alpaka GPU and CPU rechit energies are validated with slightly modified workflow 12434.523 (ttbar simulation).
More validation comparisons are being worked on.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 6, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented May 6, 2024

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44910/40188

  • This PR adds an extra 248KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

cmsbuild commented May 6, 2024

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44910/40189

  • This PR adds an extra 248KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented May 6, 2024

A new Pull Request was created by @kakwok for master.

It involves the following packages:

  • CondFormats/DataRecord (db, alca)
  • CondFormats/HcalObjects (db, alca)
  • DQM/HcalTasks (dqm)
  • DataFormats/HcalDigi (simulation)
  • DataFormats/HcalRecHit (reconstruction)
  • EventFilter/HcalRawToDigi (reconstruction)
  • HeterogeneousCore/AlpakaInterface (heterogeneous)
  • RecoLocalCalo/Configuration (reconstruction)
  • RecoLocalCalo/HcalRecAlgos (reconstruction)
  • RecoLocalCalo/HcalRecProducers (reconstruction)

@syuvivida, @francescobrivio, @makortel, @rvenditti, @mandrenguyen, @fwyzard, @tjavaid, @mdhildreth, @saumyaphor4252, @perrotta, @consuegs, @jfernan2, @cmsbuild, @civanch, @nothingface0, @antoniovagnerini can you please review it and eventually sign? Thanks.
@argiro, @DryRun, @rsreds, @Martin-Grunewald, @PonIlya, @JanChyczynski, @rovere, @makortel, @missirol, @yuanchao, @tocheng, @bsunanda, @apsallid, @mmusich, @ReyerBand, @wang0jin, @sameasy, @thomreis, @mariadalfonso, @youyingli, @abdoulline, @rchatter, @seemasharmafnal this is something you requested to watch as well.
@sextonkennedy, @antoniovilela, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor

fwyzard commented May 6, 2024

type hcal

@cmsbuild cmsbuild added the hcal label May 6, 2024
if (sampleWithinWindow == 0) {
shrMethod0EnergyAccum[lch] = 0;
//TODO: check this conversion
shrMethod0EnergySamplePair[lch] = int(std::numeric_limits<float>::min());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::numeric_limits<float>::min() is 1.17549e-38f.
(int) std::numeric_limits<float>::min() is 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the original line from CUDA version

shrMethod0EnergySamplePair[lch] = __float_as_uint(std::numeric_limits<float>::min());

so also this line should use __float_as_uint / edm::bit_cast. Good catch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching it. I've implemented it in dceaf88

} // loop for channgel groups
}
}; //Kernel_prep1d_sameNumberOfSamples
class Kernel_prep_pulseMatrices_sameNumberOfSamples {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class Kernel_prep_pulseMatrices_sameNumberOfSamples {
class Kernel_prep_pulseMatrices_sameNumberOfSamples {

@fwyzard
Copy link
Contributor

fwyzard commented May 24, 2024

After going again through the Kernel_prep1d_sameNumberOfSamples kernel, I'm now worried that the problem may be due to a missing or incorrect synchronisation among the threads between filling and using the shared memory.

This is actually more important on CPU, where the elements are processed sequentially by a singe thread.

@fwyzard
Copy link
Contributor

fwyzard commented May 24, 2024

For example, using

alpaka::syncBlockThreads(acc);

inside the double loop

//Loop over all groups of channels
for (auto group : uniform_groups_y(acc, nchannels)) {
    //Loop over each channel, first compute soiSamples for all channels
    for (auto channel : uniform_group_elements_y(acc, group, nchannels)) {

may or may not work for the CUDA backend, but it definitely does not work for the serial CPU backend.

The correct way should be to split the loop into N parts, and put the N-1 synchronisations between each pair of loops.

@makortel
Copy link
Contributor

After going again through the Kernel_prep1d_sameNumberOfSamples kernel, I'm now worried that the problem may be due to a missing or incorrect synchronisation among the threads between filling and using the shared memory.

There are indeed some problems of that kind (and syncBlockThreads() at incorrect loop level) that @kakwok is working on to fix those.

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44910/40357

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@kakwok
Copy link
Contributor Author

kakwok commented May 27, 2024

I have added the float_as_uint fix and the attempt for fixing the issue with syncBlockThreads.
The updated code passes the racecheck of the compute-sanitizer
compute-sanitizer --target-processes all --require-cuda-init=no --tool=racecheck cmsRun step3_RAW2DIGI_RECO_VALIDATION_DQM.py

Unfortunately, the discrepancy between Alpaka GPU and legacy CPU remains.
image
image
But since I'll be on vacation this week, I would like to share the progress nonetheless.

I have not implemented the initialising of the shared memory.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44910/40358

@cmsbuild
Copy link
Contributor

@fwyzard
Copy link
Contributor

fwyzard commented May 27, 2024

With the latest changes, I don't get any errors from valgrind either.

@mmusich
Copy link
Contributor

mmusich commented May 27, 2024

in relation to #44910 (comment), I just re-tested the code (as of kakwok@9ac4309) with:

#!/bin/bash -ex

scram p CMSSW_14_0_7_MULTIARCHS
cd  CMSSW_14_0_7_MULTIARCHS/src
eval `scramv1 runtime -sh`

git cms-addpkg CondFormats/DataRecord
git cms-addpkg CondFormats/HcalObjects 
git cms-addpkg DQM/HcalTasks 
git cms-addpkg DataFormats/HcalDigi
git cms-addpkg DataFormats/HcalRecHit 
git cms-addpkg EventFilter/HcalRawToDigi 
git cms-addpkg HeterogeneousCore/AlpakaInterface 
git cms-addpkg RecoLocalCalo/Configuration 
git cms-addpkg RecoLocalCalo/HcalRecAlgos 
git cms-addpkg RecoLocalCalo/HcalRecProducers
git cms-addpkg Configuration/PyReleaseValidation
wget https://patch-diff.githubusercontent.com/raw/cms-sw/cmssw/pull/44910.diff
git apply 44910.diff
git cherry-pick 15ea9750359193d75e9e406d80d17d7c48b78788
git cms-checkdeps -a
scram b -j 20

hltGetConfiguration /dev/CMSSW_14_0_0/GRun \
		    --unprescale \
		    --output minimal \
		    --globaltag auto:phase1_2024_realistic \
		    --mc \
		    --max-events 100 \
		    --eras Run3 \
		    --l1-emulator uGT --l1 L1Menu_Collisions2024_v1_2_0_xml \
		    --input /store/mc/Run3Winter24Digi/TT_TuneCP5_13p6TeV_powheg-pythia8/GEN-SIM-RAW/133X_mcRun3_2024_realistic_v8-v2/80000/dc984f7f-2e54-48c4-8950-5daa848b6db9.root \
		    --customise HLTrigger/Configuration/customizeHLTforAlpaka.customizeHLTforAlpaka > hlt_mc.py

cmsRun hlt_mc.py &> hlt.log&

and I don't observe any residual crash.

@mandrenguyen
Copy link
Contributor

please test

@fwyzard
Copy link
Contributor

fwyzard commented May 28, 2024

Unfortunately, the discrepancy between Alpaka GPU and legacy CPU remains.

Indeed, I still see large discrepancies comparing the HLT results running on GPU with the CUDA vs Alpaka version.

Here are some of the biggest discrepancies:

      Events    Accepted      Gained        Lost       Other  Trigger
       99252        3935        +214         -85       ~7381  HLT_HT300_Beamspot_v20
       99252        1507        +246        -223       ~3417  HLT_CaloMHT90_v12
       99252        1034         +68         -46       ~3772  HLT_CaloMET90_NotCleaned_v12
       99252        2550        +164         -78       ~6796  HLT_PFHT350_v28
       99252        1202         +87         -51       ~8171  HLT_QuadPFJet103_88_75_15_v14
       99252        3902         +50        -156      ~17260  HLT_DiPhoton10_CaloIdL_v8
       99252         394          +1         -18       ~1333  HLT_SingleEle8_v7

@fwyzard
Copy link
Contributor

fwyzard commented May 28, 2024

On the other hand, the alpaka GPU results themselves are fairly reproducible, to +/- 1 event out of 100k:

      Events    Accepted      Gained        Lost       Other  Trigger
       99252          41           -          -1           -  HLT_AK8DiPFJet250_250_SoftDropMass40_v3
       99252          32           -          -1           -  HLT_AK8DiPFJet250_250_SoftDropMass50_v3
       99252         115          +1           -           -  HLT_DoubleMu4_MuMuTrk_Displaced_v24
       99252          99          +1           -          ~2  HLT_PFMET120_PFMHT120_IDTight_v29
       99252          65          +1           -           -  HLT_PFMET130_PFMHT130_IDTight_v29
       99252          34          +1           -          ~1  HLT_PFMET120_PFMHT120_IDTight_PFHT60_v18
       99252          39          +1           -           -  HLT_PFMETNoMu120_PFMHTNoMu120_IDTight_PFHT60_v18
       99252         105          +1           -          ~1  HLT_PFMETNoMu120_PFMHTNoMu120_IDTight_v29
       99252          69          +1           -           -  HLT_PFMETNoMu130_PFMHTNoMu130_IDTight_v28
       99252         168          +1          -1           -  HLT_PFMETNoMu110_PFMHTNoMu110_IDTight_FilterHF_v9
       99252          86          +1           -          ~1  HLT_PFMETNoMu120_PFMHTNoMu120_IDTight_FilterHF_v9
       99252          55          +1           -           -  HLT_PFMETNoMu130_PFMHTNoMu130_IDTight_FilterHF_v9
       99252          12           -          -1           -  HLT_Ele14_eta2p5_IsoVVVL_Gsf_PFHT200_PNetBTag0p53_v2
       99252         883          +1           -           -  HLT_PFHT330PT30_QuadPFJet_75_60_45_40_v18
       99252          25          +1           -           -  HLT_PFHT450_SixPFJet36_PNetBTag0p35_v6
       99252        2636           -          -2           -  HLT_PFHT350_v28
       99252          84          +1          -1           -  HLT_PFHT250_QuadPFJet25_PNet1BTag0p20_PNet1Tauh0p50_v3
       99252          74          +1          -1           -  HLT_PFHT250_QuadPFJet30_PNet1BTag0p20_PNet1Tauh0p50_v3
       99252          68          +1          -1           -  HLT_PFHT280_QuadPFJet30_PNet1BTag0p20_PNet1Tauh0p50_v3
       99252         528          +1           -          ~2  HLT_AK8PFJet220_SoftDropMass40_v10
       99252         430          +1           -           -  HLT_AK8PFJet230_SoftDropMass40_v10
       99252         459           -          -1           -  HLT_VBF_DiPFJet125_45_Mjj1200_v3

Which is in line with the results using I get using the CUDA version of the HCAL reconstruction.

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ddea1b/39567/summary.html
COMMIT: 9ac4309
CMSSW: CMSSW_14_1_X_2024-05-27-2300/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/44910/39567/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 3 lines from the logs
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3338862
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3338836
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 202 log files, 165 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants