Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Non-CSV Metadata / Front Matter / Comments in CSV Files #31

Open
aidanmontare-edu opened this issue May 25, 2020 · 5 comments

Comments

@aidanmontare-edu
Copy link

Sorry for the long message, I guess I've been thinking a lot about CSV's...

This issue is to suggest support for CSV files which contain non-CSV metadata or front matter at the top of the file, as well to raise the issue of comments within CSV files.

Although CSV files that begin with non-CSV metadata are beyond the type described in RFC 4180, they are quite common. Non-CSV data is typically used to include metadata about the data in the file, such as the equipment and parameters that went into an experiment.

I work with earth science data, where the idea of including multiple-line frontmatter in the file is quite common. I've attached a sample file from NASA as an example.

Supporting these kinds of files fully could entail a number of smaller changes, each of which might be considered independently. However, I've created one issue for the topic to try to unify discussion, at least at the initial stages.

Standards and Common Practices

There does not seem to be a widely-accepted standard for such files. I've ran across a few attempts at defining a standard, but they don't seem to have caught on widely:

https://csvy.org/ (looks more mature, though I don't think many libraries for CSV interaction support it)
https://github.com/csvspecs (looks to be work-in-progress)

As for common practices, I can speak to the spaces I'm familiar with, which are (mostly Python-based) tools for data processing used in the sciences and in data science.

The Pandas library supports specifying a comment character (i.e. '#') that denotes either whole lines or end-of-line comments:
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#comments-and-empty-lines

Pandas is widely used, so this gives me the idea that at least some people use these types of comments.

The NASA Space Physics Data Facility (https://cdaweb.gsfc.nasa.gov/) uses the '#' comment character and formatting of the file I attached. The website allows you to download any of the measurements in their database in this format. But it also has several other export options, including a "normal" CSV with the metadata in a separate JSON file, as well as the raw data (in netCDF, which isn't a type of CSV at all). So perhaps they expect that people who are going to do lots of analysis will use the "normal" CSV files. This is to say that, while I think CSV Schema should support CSV files with metadata, I imagine some people would argue that real-world data collection should not be done using them.

Support within CSV Schema

As for the schema:

Ignoring Comments / Metadata

@adamretter suggested adding directives to ignore leading lines when validating CSV files (text is modified from his):

  • @IgnoreLeadingLines '#', which would simply ignore all lines from line 0 that start with a '#' character up until the first line that does not start with that character.
  • @IgnoreCommentLines '#', which would just ignore any line which starts with a '#' character.
  • other options, i.e. @IgnoreLeadingLinesMatching "regular expression"

I think it would be useful to be able to ignore the leading lines, and I like these directives. The difference between @IgnoreLeadingLines and @IgnoreCommentLines is helpful, since I could see situations that call for one but not the other.

Validating Comments / Metadata

I think there also should be a way to validate the contents of the non-CSV lines, as well as the CSV data itself. But I'm not sure if this is something the CSV Schema itself should support, or if this would be better handled by a more general system that supports files with multiple parts (and might make use of CSV schema to describe the CSV part). I'm not sure whether such a system exists.

On the other hand, there definitely are CSV files like this out there, so one argument is that the CSV Schema should be able to describe them.

If this is something the CSV Schema might support, it would be helpful to have multiple options:

  • Directives like those above to ignore commented lines, for files that are allowed to contain comments, but the comments can be anything.
  • A way to validate comments in some potentially-not CSV format, for files where the comments must meet certain requirements.

What seems ideal for the purpose of validating files with metadata is a way to say "this kind of header isn't CSV, but needs to be validated with X", where X is some external schema / tool. For instance, I might pass the metadata to a JSON validator or compare it with a YAML schema.

I think it would be ideal to be able to specify the type of non-CSV data in a flexible way that does not require the CSV Schema to maintain a list of supported metadata types. This would also be useful for people (such as myself) who have CSV files with metadata that is not in any standard format, but that they nonetheless may wish to use.

It would also be helpful to do what can be done to reduce the work for those implementing the language. Someone who is creating a CSV validator may have to explicitly include support for various metadata types, but hopefully this could be as simple as piping the data to existing JSON/YAML/whatever validators in their language, rather than expecting them to include their own support for each metadata type. I'm not versed enough in this area to give detailed recommendations, but it's a point to consider.

Other thoughts

Another issue to consider is end-of-line comments that occur in the data. I'm not sure how many people have files like this, but as I mentioned above, Pandas includes support for these comments. There's also the possibility of inline comments (between data elements), but that seems really far-fetched (I don't know why someone would try to create a CSV file like that).

Yet another issue is leading lines that are not marked with a comment character at all (the only way to tell is to look where the data starts). I happen to have some unfortunately-formatted files like this. Actually, if people were to adopt the CSVY standard (first link above), this would be a problem. The YAML header in CSVY could be any length, and it isn't marked by a comment character at the beginning of each line. (The end of the YAML block has the standard "---" that denotes the end of a document in YAML.)

Uploading OMNI_HRO_1MIN_27555.csv.txt…

@aidanmontare-edu
Copy link
Author

The file didn't upload correctly, so here is the first hundred lines or so as an example.

You can download data like this from NASA's CDAWeb.

#              ************************************
#              *****    GLOBAL ATTRIBUTES    ******
#              ************************************
#
#     PROJECT                         NSSDC
#     DISCIPLINE                      Space Physics>Interplanetary Studies
#     SOURCE_NAME                     OMNI (1AU IP Data)>Merged 1 minute Interplantary OMNI data
#     DATA_TYPE                       HRO>Definitive 1minute
#     DESCRIPTOR                      IMF and Plasma data
#     DATA_VERSION                    1
#     TITLE                           Near-Earth Heliosphere Data (OMNI)
#     TEXT                            1minute averaged definitive multispacecraft interplanetary parameters data
#                                     Additional information for all parameters are available from OMNI Data Documentation:                  
#                                      https://omniweb.gsfc.nasa.gov/html/HROdocum.html                                                      
#                                     Additional data access options available at  SPDF's OMNIWeb Service: https://omniweb.gsfc.nasa.gov/ow_m
#                                     Recent omni high resolution updates Release Notes: https://omniweb.gsfc.nasa.gov/html/hro_news.html    
#     MODS                            created November 2006;
#                                     conversion to ISTP/IACG CDFs via SKTEditor Feb 2000
#                                     Time tags in CDAWeb version were modified in March 2005 to use the
#                                     CDAWeb convention of having mid-average time tags rather than OMNI's
#                                     original convention of start-of-average time tags.
#     LOGICAL_FILE_ID                 omni_hro_1min_00000000_v01
#     PI_NAME                         J.H. King, N. Papatashvilli
#     PI_AFFILIATION                  AdnetSystems, NASA GSFC
#     GENERATION_DATE                 Ongoing
#     ACKNOWLEDGEMENT                 NSSDC
#     ADID_REF                        NSSD0110
#     RULES_OF_USE                    Public
#     INSTRUMENT_TYPE                 Plasma and Solar Wind
#                                     Magnetic Fields (space)
#                                     Electric Fields (space)
#     GENERATED_BY                    King/Papatashvilli
#     TIME_RESOLUTION                 1 minute
#     LOGICAL_SOURCE                  omni_hro_1min
#     LOGICAL_SOURCE_DESCRIPTION      OMNI Combined, Definitive, 1-minute IMF and Plasma Data Time-Shifted to the                            
#                                      Nose of the Earth's Bow Shock, plus Magnetic Indices                                                  
#     LINK_TEXT                       Additional information for all parameters are available from
#                                     Additional data access options available at
#                                     Recent omni high resolution updates
#     LINK_TITLE                      OMNI Data documentation
#                                     SPDF's OMNIWeb Service
#                                     Release Notes
#     HTTP_LINK                       https://omniweb.gsfc.nasa.gov/html/HROdocum.html
#                                     https://omniweb.gsfc.nasa.gov/ow_min.html
#                                     https://omniweb.gsfc.nasa.gov/html/hro_news.html
#     ALT_LOGICAL_SOURCE              Combined_OMNI_1AU-MagneticField-Plasma-HRO_1min_cdf
#     MISSION_GROUP                   OMNI (Combined 1AU IP Data; Magnetic and Solar Indices)
#                                     ACE
#                                     Wind
#                                     IMP (All)
#                                     !___Interplanetary Data near 1 AU
#     SPASE_DATASETRESOURCEID         spase://VSPO/NumericalData/OMNI/PT1M
#     CDAWEB_PARENTS                  omni_hro_1min_00000000_v01
#                                     omni_hro_1min_20180401_v01, omni_hro_1min_20180501_v01
#     CDFMAJOR                        COL_MAJOR
#
#              ************************************
#              ****  RECORD VARYING VARIABLES  ****
#              ************************************
#
#  1. Epoch Time
#  2. Bx (nT), GSE
#  3. By (nT), GSE
#  4. Bz (nT), GSE
#
EPOCH_TIME_yyyy-mm-ddThh:mm:ss.sssZ,BX__GSE_nT,BY__GSE_nT,BZ__GSE_nT
 2018-04-30T00:00:00.000Z,-4.08000,3.15000,3.38000
 2018-04-30T00:01:00.000Z,-4.10000,3.18000,3.40000
 2018-04-30T00:02:00.000Z,-4.08000,3.09000,3.50000
 2018-04-30T00:03:00.000Z,-3.95000,3.08000,3.69000
 2018-04-30T00:04:00.000Z,-3.96000,3.12000,3.61000
 2018-04-30T00:05:00.000Z,-3.81000,3.11000,3.79000
 2018-04-30T00:06:00.000Z,-4.48000,3.06000,3.05000
 2018-04-30T00:07:00.000Z,-4.26000,3.15000,3.29000
 2018-04-30T00:08:00.000Z,-4.40000,3.21000,3.04000
 2018-04-30T00:09:00.000Z,-4.19000,3.09000,3.46000
 2018-04-30T00:10:00.000Z,-4.16000,2.92000,3.65000
 2018-04-30T00:11:00.000Z,-4.18000,3.00000,3.53000
 2018-04-30T00:12:00.000Z,-4.03000,3.18000,3.54000
 2018-04-30T00:13:00.000Z,-3.94000,3.13000,3.63000
 2018-04-30T00:14:00.000Z,-3.54000,3.05000,3.92000
 2018-04-30T00:15:00.000Z,-3.44000,3.06000,3.96000
 2018-04-30T00:16:00.000Z,-3.54000,3.25000,3.71000
 2018-04-30T00:17:00.000Z,-3.76000,3.16000,3.60000
 2018-04-30T00:18:00.000Z,-3.70000,3.14000,3.58000
 2018-04-30T00:19:00.000Z,-3.21000,3.72000,3.28000
 2018-04-30T00:20:00.000Z,-3.89000,2.71000,3.70000
 2018-04-30T00:21:00.000Z,-4.00000,2.46000,3.81000
 2018-04-30T00:22:00.000Z,-3.83000,2.79000,3.66000
 2018-04-30T00:23:00.000Z,-4.03000,2.52000,3.81000
 2018-04-30T00:24:00.000Z,-3.42000,3.38000,3.31000
 2018-04-30T00:25:00.000Z,-3.16000,3.75000,3.12000
 2018-04-30T00:26:00.000Z,-3.22000,3.81000,3.13000
 2018-04-30T00:27:00.000Z,-2.87000,4.10000,3.05000
 2018-04-30T00:28:00.000Z,-2.58000,4.59000,2.52000
 2018-04-30T00:29:00.000Z,-2.90000,4.16000,2.81000
 2018-04-30T00:30:00.000Z,-4.10000,2.37000,3.81000
 2018-04-30T00:31:00.000Z,-4.05000,2.53000,3.77000
 2018-04-30T00:32:00.000Z,9999.99,9999.99,9999.99

@grv87
Copy link

grv87 commented May 25, 2020

+1 for this.

Some known Java implementations supporting comments in CSV:

@DavidUnderdown
Copy link
Contributor

Arguably it would be better to move the comments into an appropriate official CSV schema (and such comments are not allowed in the CSV RFC), but that said it is quite common for CSV processing libraries to have a way of saying "skip/ignore x lines", another way (which might map quite well to the underlying libraries) would be to have

@IgnoreLeadingLines 5

Or whatever numeric value would be appropriate

@adamretter
Copy link
Contributor

adamretter commented May 27, 2020

A way to validate comments in some potentially-not CSV format

I think for simple comments this could make sense to allow something like @IgnoreCommentLines to take a regular expression, or add a second directive @ValidateComments "some-regex". In this manner if the comment doesn't match your regular expression then CSV validation would fail.

What seems ideal for the purpose of validating files with metadata is a way to say "this kind of header isn't CSV, but needs to be validated with X", where X is some external schema / tool. For instance, I might pass the metadata to a JSON validator or compare it with a YAML schema.

I like this idea. I think that should be part of the tool rather than the Schema though. So in our CSV Validator tool, we could add a command line arg like: --front-matter-validator. When used, this would take an argument to an executable which has to return a non-zero exit code to pass validation, e.g. --front-matter-validator /usr/bin/my-yaml-format-validator.

I do think we have to be careful about not overloading the CSV Schema. It works pretty-well at the moment because it does one thing and does it quite well. Certainly there is some scope for expansion, but some of that could be in the tool rather than the Schema.

@aidanmontare-edu
Copy link
Author

I like this idea. I think that should be part of the tool rather than the Schema though. So in our CSV Validator tool, we could add a command line arg like: --front-matter-validator. When used, this would take an argument to an executable which has to return a non-zero exit code to pass validation, e.g. --front-matter-validator /usr/bin/my-yaml-format-validator.

I do think we have to be careful about not overloading the CSV Schema. It works pretty-well at the moment because it does one thing and does it quite well. Certainly there is some scope for expansion, but some of that could be in the tool rather than the Schema.

That makes a lot of sense. It also makes nice separation between the tasks of validating the front matter and validating the CSV data.

One issue is metadata that that the CSV validator cannot validate without the help of the external tool. For instance, in the CSVY format, the YAML part does have a comment character, and it isn't a fixed number of lines, so I'm not sure how the CSV validator would know what lines to skip without the help of the YAML validator.

I managed to write my own validator program for some custom CSV formats. What I did is have the part that validates the front matter return the line numbers corresponding to the front matter. The part that validates the CSV then takes those line numbers as input, and skips them.

Perhaps there's other approaches?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants