Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about elaboration tolerance and ordering #29

Open
rpgoldman opened this issue Jul 11, 2019 · 2 comments
Open

Question about elaboration tolerance and ordering #29

rpgoldman opened this issue Jul 11, 2019 · 2 comments

Comments

@rpgoldman
Copy link

  1. Maybe I missed it, but I couldn't tell if the columns in a CSV file one is checking must come in the same order as they are listed in the body of a CSV schema.
  2. Assuming that the prolog does not specify the column count, is it acceptable to have additional columns that do not match a column entry in the body, and have them just be unchecked?

I am interested in using the validator for some scientific data where there is a known set of columns that should be checked for reasonable contents, but where I'm not sure that the ordering of columns will be consistent, and where some data providers might have added additional columns of computed values to the raw values that my schema should check.

Thank you

@DavidUnderdown
Copy link
Contributor

The ordering must match. Using the totalColumns directive means that the validator checks that there are the expected number of column definitions given at parse time. If you do not specify it there will still be a validation error once the CSV file is actually read if the number of column definitions does not match the number of columns in the file.

There are some similar issues already #21 and #13, but I'm afraid we've not had resource availableto work on further developments recently, though we would welcome pull requests from others.

@rpgoldman
Copy link
Author

rpgoldman commented Jul 12, 2019

Thanks for the response.

I suggested making the order optional because CSVs are often interpreted by tools like python's Pandas, in which the columns are name-addressable, so column ordering is not required for correct operation.

And I mentioned in my original comments that for scientific data there are often additional columns of derived quantities added that don't interfere with correct (assuming name-based addressing) processing of the data.

I imagine that these additional features could add substantially to the difficulty of validation, though.

Maybe this should be tagged as "question-edging-into-enhancement-request"!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants