Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow_Overwrite option in History does not do what name and code suggests #2653

Open
bena-nasa opened this issue Mar 18, 2024 · 8 comments
Open
Assignees
Labels
🪲 Bug Something isn't working ⌛ Long Term Long term issues ❓ Question Further information is requested

Comments

@bena-nasa
Copy link
Collaborator

bena-nasa commented Mar 18, 2024

TLDR;
Allow_Overwrite, is NOT clobbering an existing file if it exists despite what the code in History would seem to imply should be happening.

Long Version
The UFS model, which uses GOCART and MAPL, exposed or re-iterated a long standing issue with the interaction History and PFIO which was causing the UFS folks an issue we were asked to look at. The fundamental issue is, how does the server know when to create a file or open an existing file and if opening an existing file, how does it know it's actually "good"? Also how does the server know if you really do want to clobber an existing file rather than append if the file exists? Spoiler, it can't.

The problem arose when UFS was to my understanding, before running the executable, either creating a symlink of an empty file in the path History will want to write or perhaps that symlink might have been broken. Either way it is bad. Let's review the current behavior.

In History.

In History there is a global Allow_Overwrite option. It does 2 things.

First if set to true. It sets the mode of the collection to clobber. The problem is that the server the way it is designed, it just cannot clobber an existing file, so it is pointless to set this. And the client may not even transmit this to the server.

Second function, every time history decides to write, it first checks, is this a new file, if so it generates the filename, or is the same file because I'm appending multiple time steps to the file.

Either way when it is time to write to a NEW file, History does an inquire on the file path. If the file exists, one of 2 things can happen. Either we abort or we go on based on Allow_Overwrite(default false). This is basically a safety to check so we crash if the file exists, or just go on depending on what Allow_Overwrite is set to.
Note that that apparently a broken symlink returns false to an inquiry of its existence so History will always get past this point if a broken symlink were somehow created before execution.

Now if we have gotten to past this point, History makes the request to write to a file path at at given time index with no further checks and no instructions such as whether to open or clobber/re-create are passed to the server.

On the Server Side

Now consider what happens on the server side. The server just know the client requested the filename and time index to write to, but that is it.
So now the server does another inquire to see if the file exists. If the file exists, currently it assumes the file is "good" in that it has been created and populate with the right metadata. It simply opens the file, and assumes it can do a PUT_VAR to that file. Most likely this would be because has requested multiple time slices to a file so the file was previously created in the current execution.

So now we see one problem. In the case where there were empty symlinks, and History was set with Allow_Overwrite to true, it just tries to write to the file which of course fails. In general if the server had not previously created the file with the metadata consistent with the request it will clearly fail and we have no mechanism to check this, if it is even possible to do robustly.

A second problem. What if the user on the client really did want to clobber any file that was there. This just won't happen with the current logic.

The DAS exposed another example where the same file was created in a previous execution without a variable, then on the next execution if the file is not moved away, the write failed when trying to write that variable that did not exist in the file left sitting there.

Now suppose the file does not exist. The server the simply tries to create the file with a NO_CLOBBER option. If something weird like a broken symlink were given, netcdf is smart enough to fail at this point. This seems to be what was happening with UFS.

Clearly there a a lot of ways to make things go haywire were if you are not careful as the examples referenced here demonstrate.

What can be done?

At least within a single execution the server could track if a file has already been created by the server. Therefore if it gets a request to write to file it knows it did not create that execution, it could then make some decisions how to handle this (clobber it or fail).

We also should allow a file to be clobbered if that's what the client side said to do.

If we want to allow a file to be appended to between executions, this gets harder and is beyond the scope of this and is another whole disucssion.

Links to old issues

Here are links to issues that are related to this.
#1620
GEOS-ESM/GEOSadas#188
#2638

@bena-nasa bena-nasa added ❓ Question Further information is requested ⌛ Long Term Long term issues labels Mar 18, 2024
@bena-nasa bena-nasa added the 🪲 Bug Something isn't working label Mar 18, 2024
@weiyuan-jiang
Copy link
Contributor

What can be done?

  1. An empty file without anything does not make sense. Can we specify in HISTORY where we create a file ? That may lead to writing files outside of scratch

  2. Are we going to support this ? " If we want to allow a file to be appended to between executions, this gets harder..."

@bena-nasa
Copy link
Collaborator Author

bena-nasa commented Mar 18, 2024

@weiyuan-jiang The issue just more basic. History specifies a file path. We have no control over what someone may or may not have put in the path. If the user does something silly like put a file there that's wrong/empty, whatever, we can't control that.

What we an do is this in a single execution:

  1. have the server remember when it creates a file for a collection, that way if the user put something in that path before execution, the server can say I didn't create that and take appropriate action, such as abort or clobber (see 2nd point). Or maybe it did create it and there are multiple times steps being written. Then it can say, well I already created that file, I got a request to write to it, the user must be wanting to write to it again and that is safe since I know I created it and not something else...
  2. If the user truly wants to play dangerous we could transmit the user intention to the server (right now e don't). So if the user says on the History side, clobber any file there, send that to the server. And anytime we write to a file if the file was not created by the server in that execution, just clobber it with no questions asked.

If we are talking multiple executions I don't think we can/should do anything to allow the same file to be written to in multiple executions. This just gets dangerous/difficult

@tclune
Copy link
Collaborator

tclune commented Mar 18, 2024

@bena-nasa I'm concerned that we should be able to support a collection that has say 4 time slices per day but the run executes just 12 hours in a segment. Granted this is rare at the moment, but I would not want to toss out the GEOS use case just to support external people.

@bena-nasa
Copy link
Collaborator Author

bena-nasa commented Mar 26, 2024

@bena-nasa I'm concerned that we should be able to support a collection that has say 4 time slices per day but the run executes just 12 hours in a segment. Granted this is rare at the moment, but I would not want to toss out the GEOS use case just to support external people.

@tclune, I don't understand the comment or what you are getting at here . What you are describing is no use or capability History currently has...

Right now if you are asking for output every 6 hours, but set the duration 24 hours, and you only run 12 hours in the segment, like say you start at 0z, you would get the 6z and 12z output in the file.

What does not work is if you start at 0z, stop at 12z, in the above scenario (6 hour frequency, 24 hour duration), we've never supported appending to an existing file or if it somehow worked it was just dumb luck. Nor is anyone actually trying to do this as far as I know.

Are you thinking how you want this work in the new History in 3G perhaps...?

@tclune
Copy link
Collaborator

tclune commented Mar 26, 2024

I had assumed your intent was to extend the file during the next segment. (Regardless of how it may or may not work now.)

@tclune
Copy link
Collaborator

tclune commented Mar 26, 2024

I see two problems with not supporting this "extend the file":

  1. If the segment duration varies (e.g., due to a startup issue or possibly some weird restart issue), then post processing tools must allow for that. I presume ExtData does not assume the same number of slices in each file?
  2. We may end up with 2 files with the same name. If the template does not have hours in your scenario above, then the restart writes a file with the same name as the 1st segment, right?

@bena-nasa bena-nasa changed the title History, PFIO, and file existence Allow_Overwrite option in History does not do what name suggests Mar 26, 2024
@bena-nasa
Copy link
Collaborator Author

I had assumed your intent was to extend the file during the next segment. (Regardless of how it may or may not work now.)

@tclune. Yes your point 2 is right, but I don't really have an intent here beyond what was in the original issue. Extending files across segments is not what I was worrying about when opening the issue.
My point was the that the Allow_Overwrite option is not quite working the way I think everyone thinks and this is because of how the server works.

@bena-nasa bena-nasa changed the title Allow_Overwrite option in History does not do what name suggests Allow_Overwrite option in History does not do what name and code suggests Mar 26, 2024
@tclune
Copy link
Collaborator

tclune commented Mar 26, 2024

In theory (2) could be handled by History refusing to run if it sees that a subsequent continuation segment will reuse the same file name. Force the user to either align History with segments or ensure segments are longer than any specific file duration. (Except things that will only have one slice of course.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪲 Bug Something isn't working ⌛ Long Term Long term issues ❓ Question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants