delimiteds

Using property testing to feel out the input space of "delimited text file"

Delimited Text File Tests

CommaDelimitedWithHeaderTrueAndOtherwiseDataFrameReaderDefaults

Tests for processing comma delimited files that should be handled well by Spark DataFrameReader defaults, i.e. no multiline values in a record, no special chars to escape, no whitespace trimming, or date/timestamps to format. The first two tests in the file are illustrative, with handwritten datasets and assertions that can be tracked visually. The last and third test in the file uses ScalaCheck to generate unicode string data of random sizes and asserts our processing succeeds without knowing what the data actually looks like.

Examples of what this test's "unicode string data of random sizes" looks like, check out this repository's generated-samples directory.

This test surfaced a good fail case that enabling multiline sometimes fixed. The generator used is making strings from arbitrary unicode chars... when excluding delimiters from input, I was checking that my abitrary string wasn't equal to the excluded chars instead of checking that it didn't contain them. This wasn't a bug in the "app" code, it was a bug in my assumptions... somthing that property tests are good at shining light on.

TabDelimitedWithHeaderTrueMultiLineAndNestedDoubleQuotesEscapedEnabled

Tests for processing tab delimited files that should be handled well by Spark DataFrameReader when multiline values in a record is set to true. The first two tests in the file are illustrative, with handwritten datasets and assertions that can be tracked visually. The last and third test in the file uses ScalaCheck to generate unicode string data of random sizes and asserts our processing succeeds without knowing what the data actually looks like.

Examples of what this test's "unicode string data of random sizes" looks like, check out this repository's generated-samples directory.

This surfaced some very good fail cases that helped me decide on hard rules about the data if multiline is going to be enabled. The values with newlines in them have to have a quotation delimiter, and if that quotation delimiter is nested, those nested quotation delimiters have to be escaped.

TabDelimitedWithHeaderTrueAndQuotationDisabled

Tests for processing tab delimited files that should be handled well by Spark DataFrameReader when quotation is disabled. The first two tests in the file are illustrative, with handwritten datasets and assertions that can be tracked visually. The last and third test in the file uses ScalaCheck to generate unicode string data of random sizes and asserts our processing succeeds without knowing what the data actually looks like.

Examples of what this test's "unicode string data of random sizes" looks like, check out this repository's generated-samples directory.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
generated-samples		generated-samples
project		project
src		src
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
county-list.csv		county-list.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

generated-samples

generated-samples

project

project

src

src

.gitignore

.gitignore

.scalafmt.conf

.scalafmt.conf

LICENSE

LICENSE

README.md

README.md

build.sbt

build.sbt

county-list.csv

county-list.csv

Repository files navigation

delimiteds

Delimited Text File Tests

CommaDelimitedWithHeaderTrueAndOtherwiseDataFrameReaderDefaults

TabDelimitedWithHeaderTrueMultiLineAndNestedDoubleQuotesEscapedEnabled

TabDelimitedWithHeaderTrueAndQuotationDisabled

About

Releases

Packages

Languages

License

shnewto/delimiteds

Folders and files

Latest commit

History

Repository files navigation

delimiteds

Delimited Text File Tests

About

Topics

Resources

License

Stars

Watchers

Forks

Languages