Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CatchCN CLM4.5 on SLES15 #896

Open
biljanaorescanin opened this issue Jan 31, 2024 · 5 comments
Open

CatchCN CLM4.5 on SLES15 #896

biljanaorescanin opened this issue Jan 31, 2024 · 5 comments

Comments

@biljanaorescanin
Copy link
Contributor

During SLES15 testing our Nightly test for CatchCN CLM4.5 has floating point overflow error.

@mathomp4 backtrace it to:
==== backtrace (tid: 172041) ====
0 0x0000000000016910 __funlockfile() ???:0
1 0x0000000000008e30 _ZGVbN4v_expf_sse4() ???:0
2 0x0000000000059378 __cnfiremod_MOD_cnfirearea() /discover/nobackup/mathomp4/SystemTests/builds/LDAS_GNUCONUS_SLES15/CURRENT/GEOSldas/src/Components/GEOSldas_GridComp/@GEOSgcm_GridComp/GEOSagcm_GridComp/GE
OSphysics_GridComp/GEOSsurface_GridComp/GEOSland_GridComp/GEOScatchCN_GridComp/GEOScatchCNCLM45_GridComp/CLM45/CNFireMod.F90:697

Looks like the exponential blew out.

@gmao-rreichle @weiyuan-jiang @gmao-jkolassa

@mathomp4
Copy link
Member

Note: We do build differently on Milan compared to Intel chips. Essentially it "targets" the processor. So on Intel chips with GNU we do:

 -O3 -march=haswell ...

which is honestly a bit of an old target...and I should look at moving it up.

On the Milans it is built as:

-O3 -march=znver2 ...

So, you'll most likely never get zero-diff between the two chips with GNU. But the GCM does not seem to crash...but then again, I've never run CatchCN 4.5 with GNU on SLES15. So let me try that now...

@mathomp4
Copy link
Member

Okay. I can confirm the GCM shows the same thing:

[borgl167:16582:0:16582] Caught signal 8 (Floating point exception: floating-point overflow)
==== backtrace (tid:  16564) ====
 0 0x0000000000016910 __funlockfile()  ???:0
 1 0x0000000000008e30 _ZGVbN4v_expf_sse4()  ???:0
 2 0x0000000000059378 __cnfiremod_MOD_cnfirearea()  /discover/nobackup/mathomp4/SystemTests/builds/AGCM_GNUSLES15/CURRENT/GEOSgcm/src/Components/@GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSphysics_GridComp/GEOSsurface_GridComp/GEOSland_GridComp/GEOScatchCN_GridComp/GEOScatchCNCLM45_GridComp/CLM45/CNFireMod.F90:697
 3 0x00000000000b10fc __cnecosystemdynmod_MOD_cnecosystemdyn()  /discover/nobackup/mathomp4/SystemTests/builds/AGCM_GNUSLES15/CURRENT/GEOSgcm/src/Components/@GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSphysics_GridComp/GEOSsurface_GridComp/GEOSland_GridComp/GEOScatchCN_GridComp/GEOScatchCNCLM45_GridComp/CLM45/CNEcosystemDynMod.F90:255

I can think of flag tricks to try, though they'll all probably be non-zero-diff...

@gmao-rreichle
Copy link
Contributor

@mathomp4 : Thanks for looking into this. We are confident that we understand the reason for the CLM4.5 crash. It's almost certainly caused by a calculation that involves exp(-1e15)*something and should be easy to fix. I'm surprised we didn't run into this issue before. It will probably require a hotfix of the GCM GridComp develop branch to get this to behave on SLES15.

@mathomp4
Copy link
Member

Oh. Okay. Well, I have a possible workaround if not. You can compile CNFire.F90 as -O1 if GNU Release. But a fix is better! 😄

@biljanaorescanin
Copy link
Contributor Author

biljanaorescanin commented Feb 14, 2024

This issue will be addressed with this PR: #900

We should disable CatchCN CLM4.5 from both GEOSldas Nightly Tests and GCM ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants