-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alpaka implementation of Hcal Local Reconstruction (Mahi) #44910
base: master
Are you sure you want to change the base?
Conversation
cms-bot internal usage |
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44910/40188
Code check has found code style and quality issues which could be resolved by applying following patch(s)
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44910/40189
|
A new Pull Request was created by @kakwok for master. It involves the following packages:
@syuvivida, @francescobrivio, @makortel, @rvenditti, @mandrenguyen, @fwyzard, @tjavaid, @mdhildreth, @saumyaphor4252, @perrotta, @consuegs, @jfernan2, @cmsbuild, @civanch, @nothingface0, @antoniovagnerini can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
type hcal |
if (sampleWithinWindow == 0) { | ||
shrMethod0EnergyAccum[lch] = 0; | ||
//TODO: check this conversion | ||
shrMethod0EnergySamplePair[lch] = int(std::numeric_limits<float>::min()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::numeric_limits<float>::min()
is 1.17549e-38f
.
(int) std::numeric_limits<float>::min()
is 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the original line from CUDA version
shrMethod0EnergySamplePair[lch] = __float_as_uint(std::numeric_limits<float>::min()); |
so also this line should use
__float_as_uint
/ edm::bit_cast
. Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching it. I've implemented it in dceaf88
} // loop for channgel groups | ||
} | ||
}; //Kernel_prep1d_sameNumberOfSamples | ||
class Kernel_prep_pulseMatrices_sameNumberOfSamples { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class Kernel_prep_pulseMatrices_sameNumberOfSamples { | |
class Kernel_prep_pulseMatrices_sameNumberOfSamples { |
After going again through the This is actually more important on CPU, where the elements are processed sequentially by a singe thread. |
For example, using alpaka::syncBlockThreads(acc); inside the double loop //Loop over all groups of channels
for (auto group : uniform_groups_y(acc, nchannels)) {
//Loop over each channel, first compute soiSamples for all channels
for (auto channel : uniform_group_elements_y(acc, group, nchannels)) { may or may not work for the CUDA backend, but it definitely does not work for the serial CPU backend. The correct way should be to split the loop into N parts, and put the N-1 synchronisations between each pair of loops. |
There are indeed some problems of that kind (and |
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44910/40357
Code check has found code style and quality issues which could be resolved by applying following patch(s)
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44910/40358
|
Pull request #44910 was updated. @mdhildreth, @cmsbuild, @saumyaphor4252, @makortel, @sunilUIET, @subirsarkar, @srimanob, @AdrianoDee, @nothingface0, @francescobrivio, @rvenditti, @syuvivida, @consuegs, @tjavaid, @antoniovagnerini, @perrotta, @jfernan2, @fwyzard, @miquork, @mandrenguyen, @civanch can you please check and sign again. |
With the latest changes, I don't get any errors from valgrind either. |
in relation to #44910 (comment), I just re-tested the code (as of kakwok@9ac4309) with: #!/bin/bash -ex
scram p CMSSW_14_0_7_MULTIARCHS
cd CMSSW_14_0_7_MULTIARCHS/src
eval `scramv1 runtime -sh`
git cms-addpkg CondFormats/DataRecord
git cms-addpkg CondFormats/HcalObjects
git cms-addpkg DQM/HcalTasks
git cms-addpkg DataFormats/HcalDigi
git cms-addpkg DataFormats/HcalRecHit
git cms-addpkg EventFilter/HcalRawToDigi
git cms-addpkg HeterogeneousCore/AlpakaInterface
git cms-addpkg RecoLocalCalo/Configuration
git cms-addpkg RecoLocalCalo/HcalRecAlgos
git cms-addpkg RecoLocalCalo/HcalRecProducers
git cms-addpkg Configuration/PyReleaseValidation
wget https://patch-diff.githubusercontent.com/raw/cms-sw/cmssw/pull/44910.diff
git apply 44910.diff
git cherry-pick 15ea9750359193d75e9e406d80d17d7c48b78788
git cms-checkdeps -a
scram b -j 20
hltGetConfiguration /dev/CMSSW_14_0_0/GRun \
--unprescale \
--output minimal \
--globaltag auto:phase1_2024_realistic \
--mc \
--max-events 100 \
--eras Run3 \
--l1-emulator uGT --l1 L1Menu_Collisions2024_v1_2_0_xml \
--input /store/mc/Run3Winter24Digi/TT_TuneCP5_13p6TeV_powheg-pythia8/GEN-SIM-RAW/133X_mcRun3_2024_realistic_v8-v2/80000/dc984f7f-2e54-48c4-8950-5daa848b6db9.root \
--customise HLTrigger/Configuration/customizeHLTforAlpaka.customizeHLTforAlpaka > hlt_mc.py
cmsRun hlt_mc.py &> hlt.log& and I don't observe any residual crash. |
please test |
Indeed, I still see large discrepancies comparing the HLT results running on GPU with the CUDA vs Alpaka version. Here are some of the biggest discrepancies:
|
On the other hand, the alpaka GPU results themselves are fairly reproducible, to +/- 1 event out of 100k:
Which is in line with the results using I get using the CUDA version of the HCAL reconstruction. |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ddea1b/39567/summary.html Comparison SummarySummary:
|
PR description:
This PR port the CUDA implementation of Hcal Local Reconstruction (Mahi) to using Alpaka.
Custom SoA data structure used in CUDA for HCAL condition data and rechits are replaced with PortableCollection.
The current Alpaka implementation aims at reproducing the results from the CUDA implementation, no algorithmic changes are made.
There are 4 main pieces involved in the migration:
hcalDigisProducerPortable
): Convert CPU digis into SoA formatMahi.dev.cc
)HcalRecHitSoAToLegacy
)More details on code design are presented in the recent HLT GPU development meetings:
https://indico.cern.ch/event/1350955/
https://indico.cern.ch/event/1350953/
https://indico.cern.ch/event/1350952/
https://indico.cern.ch/event/1230377/
https://indico.cern.ch/event/1230374/
PR validation:
Alpaka GPU and CPU rechit energies are validated with slightly modified workflow 12434.523 (ttbar simulation).
More validation comparisons are being worked on.