Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with 2 moment cloud microphysics, code shows grid or fails layout regression (but does not show grid pattern) #847

Open
bena-nasa opened this issue Oct 24, 2023 · 15 comments
Assignees
Labels
bug Something isn't working

Comments

@bena-nasa
Copy link
Collaborator

bena-nasa commented Oct 24, 2023

Just making an issue for this problem since it was not done (sorry I the title is not right, but I'm not an atmospheric scientist, all these difference "moist" schemes are just gobblygook to me). This is not a new issue, user @atrayano has been looking at it with fine tooth comb I'm told.
Two users @dbarahon and @wmputman have been see "odd" patters in some of the moist exports when running the 2 moment moist physics (I assume it is two moment because they have this CLDMICR_OPTION: MGB2_2M in the AGCM.rc file.
Both of them are seeing things like this it seems when examining output, this is "IWP" for example (this picture was made with ncview, using the "low" option for the scale, @dbarahon said this is the easiest way to spot this):
bad_iwp
Where the domain decomposition is clearly appearing in the field and just not physical (I don't know how else to describe this).
In order to reproduce here is what you have to do.
First checkout v11.1.1 of the model. Then update the GEOSgcm_GridComp repo to the feature/donifan/KrokMG3 branch. When I did the test the SHA code for GEOSgcm_GridComp I used was 41d6f34f035193794b454c807e8a6a61bc7f9610.
Once built clone experiment on discover /gpfsm/dnb78s1/bmauer/donifan_weirdness. The AGCM.rc has this for the most options which I assume is critical:

###########################################################
# BACM_1M microphysics options
# ----------------------------------------
#CLDMICR_OPTION: BACM_1M
###########################################################

###########################################################
# GFDL_1M microphysics options
# ----------------------------------------
#CLDMICR_OPTION: GFDL_1M
#HYDROSTATIC: .TRUE.
###########################################################

###########################################################
# MGB2_2M microphysics options
# ----------------------------------------
CLDMICR_OPTION: MGB2_2M
#MGVERSION: 3
ENTR_DP: 2.e-4
USE_FCT: 0.
###########################################################

###########################################################
# shallow cumulus options
# ----------------------------------------
SHALLOW_OPTION: UW
###########################################################

###########################################################
# convection scheme options
# ----------------------------------------
CONVPAR_OPTION: GF
USE_GF2020: 1
# Convective plumes to be activated (1 true, 0 false)
DEEP: 1
SHALLOW: 0
CONGESTUS: 1
# Choice for the closures:
# deep : 0 ensemble (all) , 1 GR, 4 ll omega, 7 moist conv, 10 PB
# shallow : 0 ensemble (Wstar/BLQE) , 1 Wstar, 4 heat-engine or 7 BLQE
# congestus: 0 ensemble (Wstar/BLQE/PB), 1 Wstar, 2 BLQE, 3 PB, 4 PB_BL
CLOSURE_DEEP: 0
CLOSURE_SHALLOW: 7
CLOSURE_CONGESTUS: 3

Note I'm also told that @wmputman has feature branch that works off of develop branches off the GEOSgmc fixture: feature/wmputman/hwt_spring_exp but I don't want to record anything here for reproducibility that depends on a branch since branches change...

I thought user @dbarahon claimed that downgrading the optimization to O1 in the moist component fixes (maybe I misunderstood) this but that's not what my testing showed so if there was a workaround that was not it. Whether I compiled those files in moist O3 or O1 I got these weird patterns clearly showing the domain.

@bena-nasa bena-nasa added the bug Something isn't working label Oct 24, 2023
@bena-nasa
Copy link
Collaborator Author

I did confirm that that running the same experiment with the "debug" build and the issue goes away. Here is the same picture after 24 hours of "IWP" with the debug build, clearly looks much more reasonable.
Screenshot 2023-10-24 at 11 35 57 AM

@tclune
Copy link
Collaborator

tclune commented Oct 24, 2023

This is scary. It does seem to eliminate the suggestion by @wmputman that this is possibly due to an errant formula applying to the whole array.

I see 2 possibilities from this:

  1. There is a compiler bug triggered by optimizations
  2. There is an out-of-bounds reference in the model and the side effects depend on the optimization level.

(2) is more plausible in my estimation.

@mathomp4
Copy link
Member

I see some odd code in the interface:

!!!!================Estimate qcvar following Xie and Zhang, JGR, 2015
HMOIST_950 = 0.0
HSMOIST_500 = 0.0
IF (PLmb(I, J, LM) .le. 500.0) then
qcvarr8 = 2.0
ELSEIF (PLmb(I, J, LM) .lt. 950.0) then
DO K=LM, 1, -1
if (PLmb(I,J,K) .lt. 500.0) exit
HSMOIST_500 = MAPL_CP*T(I, J, K) + GZLO(I, J, K) + QST3(I, J, K)*MAPL_ALHL
END DO
HMOIST_950 = MAPL_CP*T(I, J, LM) + GZLO(I, J, LM) + Q(I, J, LM)*MAPL_ALHL
SINST = (HMOIST_950 - HSMOIST_500)/(PLmb(I,J,LM)*100.0- 50000.0)
ELSE
DO K=LM, 1, -1
if (PLmb(I,J,K) .lt. 500.0) exit
HSMOIST_500 = MAPL_CP*T(I, J, K) + GZLO(I, J, K) + QST3(I, J, K)*MAPL_ALHL
END DO
DO K=LM, 1, -1
if (PLmb(I,J,K) .lt. 950.0) exit
HMOIST_950 = MAPL_CP*T(I, J, K) + GZLO(I, J, K) + Q(I, J, K)*MAPL_ALHL
END DO
SINST = (HMOIST_950 - HSMOIST_500)/45000.0
ENDIF
xscale = (9000.0/imsize)**(-0.666)
qcvarr8 = 0.67 -0.38*SINST + 4.96*xscale - 8.32*SINST*xscale
qcvarr8 = min(max(qcvarr8, 0.5), 50.0)
if (associated(QCVAR_EXP)) QCVAR_EXP(I, J) = real(qcvarr8)
relvarr8 = qcvarr8

From my reading of that, SINST is only defined for PLmb greater than 500 mb, below that it just does:

qcvarr8  = 2.0

but then later on this:

qcvarr8 =  0.67 -0.38*SINST +  4.96*xscale - 8.32*SINST*xscale  

occurs for all pressures which would just replace the sub 500mb value. But since SINST isn't defined for some cases, it might be filling with garbage.

Still, I can't see how this could cause the MPI layout to appear in the output. And I have no idea of qcvarr8 or relvarr8 matter later on or if they are just diagnostic.

It does look like QCVAR_EXP is a possible export to look at, but I don't think it's one of the default ones. Maybe that looks weird?

@mathomp4
Copy link
Member

@bena-nasa Tested setting SINST to 0 in the <500 mb case but it didn't help the weird pattern. Still, @dbarahon should look at this code and try and figure out what it should be doing.

@bena-nasa
Copy link
Collaborator Author

bena-nasa commented Oct 24, 2023

Another data point. I tried building this model version (v11.1.1 with donifan's branch) with gfortran. A c90 stock experiment (the 1 movement moist) runs fine. However, if I turn on the 2 moment physics it just crashes in the first run step with either the debug or optimized gfortran build, so apparently something in 2 moment is so screwy that even gfortran with no optimization can't handle it.
Who knows if it related to the original issue at hand, but clearly another red flag that something is not well in this 2 moment code.

@dbarahon
Copy link
Contributor

I see some odd code in the interface:

!!!!================Estimate qcvar following Xie and Zhang, JGR, 2015
HMOIST_950 = 0.0
HSMOIST_500 = 0.0
IF (PLmb(I, J, LM) .le. 500.0) then
qcvarr8 = 2.0
ELSEIF (PLmb(I, J, LM) .lt. 950.0) then
DO K=LM, 1, -1
if (PLmb(I,J,K) .lt. 500.0) exit
HSMOIST_500 = MAPL_CP*T(I, J, K) + GZLO(I, J, K) + QST3(I, J, K)*MAPL_ALHL
END DO
HMOIST_950 = MAPL_CP*T(I, J, LM) + GZLO(I, J, LM) + Q(I, J, LM)*MAPL_ALHL
SINST = (HMOIST_950 - HSMOIST_500)/(PLmb(I,J,LM)*100.0- 50000.0)
ELSE
DO K=LM, 1, -1
if (PLmb(I,J,K) .lt. 500.0) exit
HSMOIST_500 = MAPL_CP*T(I, J, K) + GZLO(I, J, K) + QST3(I, J, K)*MAPL_ALHL
END DO
DO K=LM, 1, -1
if (PLmb(I,J,K) .lt. 950.0) exit
HMOIST_950 = MAPL_CP*T(I, J, K) + GZLO(I, J, K) + Q(I, J, K)*MAPL_ALHL
END DO
SINST = (HMOIST_950 - HSMOIST_500)/45000.0
ENDIF
xscale = (9000.0/imsize)**(-0.666)
qcvarr8 = 0.67 -0.38*SINST + 4.96*xscale - 8.32*SINST*xscale
qcvarr8 = min(max(qcvarr8, 0.5), 50.0)
if (associated(QCVAR_EXP)) QCVAR_EXP(I, J) = real(qcvarr8)
relvarr8 = qcvarr8

From my reading of that, SINST is only defined for PLmb greater than 500 mb, below that it just does:

relvarr8 and qcvarr8 are not diagnostic, however the line qcvarr8 = min(max(qcvarr8, 0.5), 50.0) is a limiter. It might be that without setting a value to SINST, qcvarr8 becomes nan bypassing the line. However I haven't seen any weirdness in the QCVAR_EXP export. Initializing SINST to zero should fix it.

@bena-nasa
Copy link
Collaborator Author

bena-nasa commented Nov 2, 2023

Summary of facts to the best of my knowledge as of November 2st, 2023:
The main issue as well as the required tag (which specifies the compiler of course is in the first post for the issue so will repeat).
But the TLDR is with Intel if you choose the release build with the 2 moment physics you get these weird patterns in certain moist variables. Another way to see this is that the model does not pass layout regression with the 2 moment microphysics.

Ways to fix this:
I have found that building GEOS_MGB2_2M_InterfaceMod.F90 with -O0 seems to fix it. Also just commenting out mmciro_pcond/micro_mg_tend_interface (which I think ARE the 2 moment computational core) so you are just bypassing the code which "fixes" by virtue of the fact you don't exercise the code. In quick check of those interfaces (which is quite daunting because they have 100+ and 200+ arguments) I have not seen anything like a mismatched bounds for say center/edge variables.

The debug build appears to pass regression/not show these weird patterns.

What is confusing is that I took he debug build and selectively compiled just the moist library with the release flags and that did not show the bad behavior. So it is like memory is just being moved around it is not JUST that one file being built with optimization that can trigger this.

gfortran (using gcc 12.1 and openmpi 4.1.3):
The code does appear to pass layout regression (which we assume means it is behaving right) with optimized gfortran. There is one caveat. By default when you choose the 2 moment physics, a timestep of 1800 seconds with an 1800 second component timestep for CHEMISTRY/GOCART/HEMCO/GF/UW. If you choose the 1 moment the default appears to be 450 second tilmestep with a 450 second for those components.

For some reason if you use this 1800 second tilmestep, the model just crashes in the dynamics right away so to do the gfortran tests I had to set the tilmestep back to 450 seconds; note that intel has no trouble with this timestep. I have made a separate issue for that:
#850

@bena-nasa
Copy link
Collaborator Author

bena-nasa commented Nov 6, 2023

@wmputman @dbarahon
Is the 2 moment broken in some tags? If use 11.3.1 (one of the latest model releases) and choose the 2 moment option in gcm_setup. NOTHING runs. I've tried even the debug build with our stock testing restarts at c90 and c24 and even the debug build just segfaults in the first timestep in that 2 moment code in both cases.

I've been trying it outside of the 11.1.1 + donifan's 2 moment branch, to see if that also shows this but at least in 11.3.1 if you choose 2 moment, seems not to run.

@GEOS-ESM GEOS-ESM deleted a comment from tclune Nov 6, 2023
@bena-nasa
Copy link
Collaborator Author

bena-nasa commented Nov 6, 2023

my earlier observation was not correct that there was an out of bounds issue with c24 (I had done something to disable the callback when I was futzing with the component testing frame that was causing that erroneous error). I was able to setup a c24 experiment following Donifan's c90 using v11.1.1 + donifan 2 moment branch and that also does not pass layout regress.

So at least the "2 moment problem" seems to be reproducible at lower resolution with specific model release/branch version I have been testing with. The 2 moment in v11.3.1 is still broken though it seems...

@bena-nasa
Copy link
Collaborator Author

bena-nasa commented Nov 7, 2023

A fear more data points.

I redid the only build most with optimization, and build the rest of geos with the debug flags. Unfortunately I must not have done the test right before. The rational with this test would be that if this is a compiler optimization problem this executable would also not pass layout regression. This is with v11.1.1 and the donifan branch. Unfortunately his crashes with this error even before reaching the 2 moment.

libGEOSmoist_Grid  00002AF037985384  convpar_gf2020_mp        2435  ConvPar_GF2020.F90
libGEOSmoist_Grid  00002AF037977F7E  convpar_gf2020_mp        1591  ConvPar_GF2020.F90
libGEOSmoist_Grid  00002AF037966332  convpar_gf2020_mp         633  ConvPar_GF2020.F90
libGEOSmoist_Grid  00002AF0378DD22F  geos_gf_interface         517  GEOS_GF_InterfaceMod.F90
libGEOSmoist_Grid  00002AF037A61AEE  geos_moistgridcom        5311  GEOS_MoistGridComp.F90

which implies there's even more issues here.

2nd data point. Apparently all these changes to the 2 moment have NOT been merged into any stable tag to the best of my knowledge and that the 2 moment option in any v11.x.y tag should be consider non-functional. One must either use the donifan branch referenced in the first post OR one must use the feature/wmputman/hwt_spring_exp tag of @wmputman. I tried cloning v11.3.2 (the last GEOSgcm fixture release) and updating the GEOSgcm_GridComp to feature/wmputman/hwt_spring_exp which at the time of cloning was on commit 5cd9926. I did the standard release build, then used that executable in the same experiment I have been running all along at c24. That also did not pass layout regression so clearly this branch has the same issue.

I am concerned that no one can give me stable tag, only branches to reproduce this which of course can change. Hence why I have been referencing the SHA codes in this issue.

@bena-nasa
Copy link
Collaborator Author

bena-nasa commented Nov 8, 2023

and yet another disturbing thing. As I reported in the previous post, with @wmputman branch I can reproduce the problem we are after using the release build. BUT, if I use the DEBUG build of the model configuration as described in the previous post the model crashes like so with a floating point exception rather than running and passing layout regression like Donifan's branch does and crashing in ConvPar_GF2020.F90

@dbarahon
Copy link
Contributor

dbarahon commented Nov 10, 2023

I tried cloning v11.3.2. Running MG1 out of the box without updates, setting CLDMICR_OPTION: MGB2_2M, MGVERSION: 1, CONVPAR_OPTION: GF, USE_GF2020: 1 and USE_FCT: 0., does not show the weird behavior. This seems to conflict with @bena-nasa results.
The run can be found at /gpfsm/dnb34/dbarahon/DEV/MG1_v1132
This would of course point to the latest updates as the source of the problem. However from my prior experience the layout issue would show up for seemingly unrelated changes like adding a constant or so to the code. In any case seems like the course of action would be to add the latest changes one by one and test every time.

@bena-nasa
Copy link
Collaborator Author

bena-nasa commented Nov 13, 2023

@dbarahon @wputman so I thought that Bill's branch could reproduce the problem, well, I was using layout reproducibility as a proxy. Donifan's branch when showing these bad patterns was not layout reproducible. I assumed they were one and the same. So it seems the experiment run with bill's branch, I'm not seeing the "grid" pattern when I actually looked.

So yay, but even after a step 2 runs at different layouts diverged, but they were diverging after GF runs, not the 2 moment code.

Likewise I also cloned stock v11.3.2 and used the same AGCM.rc settings as @dbarahon used in the experiment in the previous pos in my own c90 experiment. That also did not show any "grid" pattern in the moist fields but also did not pass layout regression so it is absolutely showing weird behavior.

So clearly there is another issue with this code since it absolutely should pass layout regression. The question is, is this related to the "grid pattern" problem, another manifestation of a deeper issue?

@bena-nasa bena-nasa changed the title Issue with 2 moment cloud microphysics Issue with 2 moment cloud microphysics, code shows grid or fails layout regression (but does not show grid pattern) Nov 13, 2023
@bena-nasa
Copy link
Collaborator Author

bena-nasa commented Nov 15, 2023

@wmputman this 2 moment code is not right that I'm testing with your branch (and that I guess Scott is making a release of). Besides the fact it doesn't layout reproduce, the debug build of the mode just crashes in GF and worse, if I take release build of the model, but say build just the moist library with debug build, it fails too, but fails differently also in GF! I'll pursue these, but in my mind I would consider the 2 moment option to be non-functional...

@bena-nasa
Copy link
Collaborator Author

bena-nasa commented Nov 22, 2023

Update of situation as of November 22nd, 2023 as I am to see:

Scott has a released a new version of the GEOSgcm fixture v11.3.3, that has all of Bill's and Donifan's changes (Bill had broken layout regression in GF2020 too in his development branch which confused me for a while, but he fixed that in what was passed to Scott and GF2020 passes layout regression for me in this release) they wanted to see get into a release so that we can have a stable release to serve as the baseline to track the issue.

In the discussion below I will refer to the file GEOS_MGB2_2M_InterfaceMod.F90 as the "MG2 interface" for brevity. My test to find where the the code was diverging in different layouts was to add multiple calls to "MAPL_CheckpointState" and dump out the internal state of moist at various strategic points in the MG2 interface the first time the run method is executed. I simply ran the code at 2 layouts and found the first dumped checkpoint in the sequence that differed.

Using this release the following can be stated when MG2 is enabled, in a C90 experiment. All these tests were performed on either skylake or cascade lake nodes. I have not done this experiment on SCU17.
With Intel (both compiler and mpi 2021.6.0, the stock compiler used by the v11.3.3 of the fixture on sles12)

  • The debug build passes layout regression, no weird patterns are seen
  • The release build does not pass layout regression, results diverge on the first step when mmicro_pcond is called, however, I do no see the "grid imprint" pattern in the IWP output after that first step. The two different layouts simply different answers but neither have this odd pattern that I could only see with donifan's branch after a step.
  • If the MG2 interface is build at O0, in the release build the code can pass layout regression

With gfortran (openmpi 4.1.3, gcc 12.1.0, same baselibs version as in stock tag)

  • The release build does not pass layout regression, but things diverge earlier in a call the to the "hystpdf" procedure in the MG2 interface. Once again this happens on the first step.
  • With the debug build, does not pass layout regression, the results diverge on the first step after calling mmicro_pcond. Once again no grid imprint is seen

The summary, the new release does not pass layout regression when MG2 is turned, either because of the call to hystpdf or mmicro_pcond depending on the optimization/compiler.

I cannot reproduce this grid imprint Donifan was seeing, but in my mind this is irrelevant. The code is not passing layout regression with 2 different compilers. As far as where to investigate, mmicro_pcond seems to be the place (although the gfortran results suggest that perhaps the memory corruption may move).

I am concerned simply that the mmicro_pcond and micro_mg_tend_interface (an alternative to mmicro_pcond) have a ridiculous number of arguments. We have seen an intel bug that was related to the number of arguments and the shear number and the fact it is called in per-column I-j loop makes this incredibly hard to debug/investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants