(#5296) Support Parquet predicates/projections in tests #5309

clairemcginty · 2024-03-19T20:52:11Z

WIP of Parquet projection/predicate support in JobTest.

Alternately, we could provide custom assertions similar to CoderAssertions, i.e.:

val record: AvroType = ???
record withPredicate(FilterApi.and(...)) withProjection(...schema...) should eq(...)

but I think that's overall harder for users to work with.

The downside of this approach is that I'll have to implement separately for ParquetAvroIO, ParquetTypeIO, ParquetExampleIO, and SmbIO (TypeIO/ExampleIO won't need projections support but they could support filtering).

any feedback on the different possible approaches here is welcome!

clairemcginty · 2024-03-19T20:52:46Z

scio-parquet/src/main/scala/com/spotify/scio/parquet/package.scala

@@ -74,4 +74,28 @@ package object parquet {
    private[parquet] def ofNullable(conf: Configuration): Configuration =
      Option(conf).getOrElse(empty())
  }
+
+  private[parquet] def inMemoryOutputFile(baos: ByteArrayOutputStream): OutputFile =


feels a bit hack-ish to have these in src/main, but I don't see a way around it :(

the alternative is to actually write a temp file for every record we roundtrip, which would allow us to use all built-in Parquet IOs

codecov · 2024-03-19T21:09:35Z

Codecov Report

Attention: Patch coverage is 86.36364% with 9 lines in your changes are missing coverage. Please review.

Project coverage is 61.23%. Comparing base (79a0ecb) to head (78dd162).

Files	Patch %	Lines
...potify/scio/testing/parquet/ParquetTestUtils.scala	85.71%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5309      +/-   ##
==========================================
+ Coverage   61.08%   61.23%   +0.15%     
==========================================
  Files         306      308       +2     
  Lines       10993    11059      +66     
  Branches      774      785      +11     
==========================================
+ Hits         6715     6772      +57     
- Misses       4278     4287       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kellen · 2024-03-20T17:37:12Z

This kind of breaks the boundary between the read and the test mocks. We don't for example implement the filtering for the BQ storage API. What we could potentially instead do is recommend that users implement their filters separately, and if they need to test them do:

val allMocks: Seq[T] = ???
val filteredMocks = allMocks.parquetFilter(MyJob.MY_FILTER_API)
// ...
.input(SmbIO[K, T]("foo", _.getKey), filteredMocks)

clairemcginty · 2024-03-20T17:55:43Z

This kind of breaks the boundary between the read and the test mocks. We don't for example implement the filtering for the BQ storage API. What we could potentially instead do is recommend that users implement their filters separately, and if they need to test them do:
val allMocks: Seq[T] = ???
val filteredMocks = allMocks.parquetFilter(MyJob.MY_FILTER_API)
// ...
.input(SmbIO[K, T]("foo", _.getKey), filteredMocks)

yeah I do see what you mean! It's tough because making it part of the TestIO itself is simplest thing from a usability perspective, but it does deviate from typical expectation of how JobTest works.

We could implement a parquetFilter type method, as you suggested, in the scio-test artifact, to keep it from being used in production (my initial concern) 🤔

clairemcginty · 2024-03-21T17:56:04Z

This kind of breaks the boundary between the read and the test mocks. We don't for example implement the filtering for the BQ storage API. What we could potentially instead do is recommend that users implement their filters separately, and if they need to test them do:
val allMocks: Seq[T] = ???
val filteredMocks = allMocks.parquetFilter(MyJob.MY_FILTER_API)
// ...
.input(SmbIO[K, T]("foo", _.getKey), filteredMocks)

Re-implemented as a set of helpers in scio-test

clairemcginty · 2024-03-21T17:56:24Z

scio-test/src/main/scala/com/spotify/scio/testing/ParquetHelpers.scala

+import java.io.{ByteArrayInputStream, ByteArrayOutputStream}
+import scala.reflect.ClassTag
+
+object ParquetHelpers {


I don't love the names for any of these objects/classes, better ideas welcome

clairemcginty · 2024-03-21T18:27:01Z

build.sbt

@@ -704,6 +704,14 @@ lazy val `scio-test` = project
      "org.scalactic" %% "scalactic" % scalatestVersion,
      "org.scalatest" %% "scalatest" % scalatestVersion,
      "org.typelevel" %% "cats-kernel" % catsVersion,
+      // provided
+      "com.spotify" %% "magnolify-parquet" % magnolifyVersion % Provided,


I'm assuming that anyone using these helpers already has compile-time Parquet dependencies... 🤷‍♀️

scio-test/src/main/scala/com/spotify/scio/testing/ParquetTestUtils.scala

RustedBones · 2024-03-26T14:57:46Z

scio-test/src/main/scala/com/spotify/scio/testing/ParquetTestUtils.scala

+    roundtripped
+  }
+
+  private def inMemoryOutputFile(baos: ByteArrayOutputStream): OutputFile = new OutputFile {


Looks a class would be more appropriate

Suggested change

private def inMemoryOutputFile(baos: ByteArrayOutputStream): OutputFile = new OutputFile {

private class InMemoryOutputFile(baos: ByteArrayOutputStream) extends OutputFile {

scio-test/src/main/scala/com/spotify/scio/testing/ParquetTestUtils.scala

RustedBones · 2024-04-17T12:59:37Z

scio-test/parquet/src/test/scala/com/spotify/scio/testing/ParquetTestUtilsTest.scala

+      .parquetFilter(
+        FilterApi.gt(FilterApi.intColumn("int_field"), 5.asInstanceOf[java.lang.Integer])
+      )
+      .parquetProject(
+        SchemaBuilder.record("TestRecord").fields().optionalInt("int_field").endRecord()
+      )


I'm not 100% convinced on this syntax.
IMHO It would be nicer to develop custom scalatest matchers instead. WDYT ?

agree it's a bit clunky! Plus, I'm now working on the TFExample integration now, and discovering that the parquetFilter/parquetProject APIs don't work for this case as Examples require an explicit Schema to be passed, so it's getting messy

Custom matchers would be nice syntactically. Something like:

records withProjection(...) should ... records withFilter(...) should ...

?

The only issues is that, IMO, many users will want to plug this directly into their JobTests, to verify that the projection won't generate a NPE in the scio job logic, for example. This would be a bit harder to do with the scalatest matcher approach

Yeah this is the primary use-case IMO; users want to generate unfiltered data and simulate the filter/projection being applied in the pipeline

Reviving this thread! I guess a custom matcher could work if it supported Iterables, ie:

withPredicate(records, predicate) should { filteredRecords => JobTest[T] .input(ParquetAvroIO(path), filteredRecords) }

but it is a bit awkward to use. I think the primary use case of these helpers is to simply make sure that the protection/predicate is compatible with the Scio job logic. Which actually points back to us supporting it natively in Parquet*IO in JobTest 😅😅

I was thinking about this more... maybe it makes the most sense to implement this as a non-default Coder in scio-test-parquet, that can be constructed by the user with a projection/predicate:

def parquetTestCoder[T <: SpecificRecord](projection: Schema, predicate: Option[FilterPredicate]): Coder[T] = ...

That way, it could be declared in whatever scope the user needs it, either in JobTest:

implicit val parquetCoder: Coder[MyRecord] = parquetTestCoder[MyRecord](projection, Some(FilterApi.lt(...)) JobTest[MyJob.type] .input(ParquetAvroIO[MyRecord]("path"), records) ...

Or just ad-hoc:

implicit val parquetCoder: Coder[MyRecord] = parquetTestCoder[MyRecord](projection, Some(FilterApi.lt(...)) val record: MyRecord = ??? record coderShould ...

wdyt?

I briefly tried out this idea and realized it won't work well as a Coder that's applied per-element: if application of FilterPredicate filters out the record, Coder#decode will return null. Thus, the user code would have to be written to handle null records:

implicit val parquetCoder: Coder[MyRecord] = parquetTestCoder[MyRecord](projection, Some(FilterApi.lt(...)) sc .parquetAvroFile[T](path) .filter(_ != null)

which is a bad pattern to enforce. so I think this idea is out.

I think we need to update the test ParquetIOs to accept projection and predicate if we want to enable this testing in JobTest

I think this is getting too complicated--IMO, let not wire it into JobTest by default, but keep these as test helpers that play nicely with scalatest (I think the custom scalatest matcher route doesn't 100% work here because withProjection/withPredicate aren't strictly Matchers or Assertions). I refactored the API to work like:

records withFilter filter withProjection projection should have size(...) records withFilter filter withProjection projection should containInAnyOrder(...)

It can also be used explicitly with JobTest, by just applying it to a test input.

clairemcginty commented Mar 19, 2024

View reviewed changes

clairemcginty marked this pull request as draft March 20, 2024 13:07

clairemcginty commented Mar 21, 2024

View reviewed changes

clairemcginty commented Mar 25, 2024

View reviewed changes

scio-test/src/main/scala/com/spotify/scio/testing/ParquetTestUtils.scala Outdated Show resolved Hide resolved

RustedBones reviewed Mar 26, 2024

View reviewed changes

clairemcginty force-pushed the parquet-assertions branch from b5064ca to 794953f Compare April 17, 2024 12:31

RustedBones reviewed Apr 17, 2024

View reviewed changes

clairemcginty marked this pull request as ready for review May 30, 2024 18:44

clairemcginty added 2 commits May 30, 2024 14:46

Populate scio-test-parquet

f7ede1c

TfExample implementation

4214363

clairemcginty force-pushed the parquet-assertions branch from 59d8a55 to 344c3f0 Compare May 30, 2024 18:46

Refactor test helper method naming

4a00498

clairemcginty force-pushed the parquet-assertions branch from 344c3f0 to 4a00498 Compare May 30, 2024 18:46

Support Avro-->case class projection

00fe0ff

clairemcginty changed the title ~~(WIP) (#5296) Support Parquet predicates/projections in tests~~ (#5296) Support Parquet predicates/projections in tests May 30, 2024

Declare dependencies

78dd162

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(#5296) Support Parquet predicates/projections in tests #5309

(#5296) Support Parquet predicates/projections in tests #5309

clairemcginty commented Mar 19, 2024 •

edited

clairemcginty Mar 19, 2024

clairemcginty Mar 20, 2024

codecov bot commented Mar 19, 2024 •

edited

kellen commented Mar 20, 2024

clairemcginty commented Mar 20, 2024

clairemcginty commented Mar 21, 2024

clairemcginty Mar 21, 2024

clairemcginty Mar 21, 2024

RustedBones Mar 26, 2024

RustedBones Apr 17, 2024 •

edited

clairemcginty Apr 17, 2024

clairemcginty Apr 17, 2024

kellen Apr 24, 2024

clairemcginty Apr 30, 2024 •

edited

clairemcginty May 27, 2024

clairemcginty May 28, 2024

RustedBones May 29, 2024

clairemcginty May 30, 2024

	private def inMemoryOutputFile(baos: ByteArrayOutputStream): OutputFile = new OutputFile {
	private class InMemoryOutputFile(baos: ByteArrayOutputStream) extends OutputFile {

(#5296) Support Parquet predicates/projections in tests #5309

Are you sure you want to change the base?

(#5296) Support Parquet predicates/projections in tests #5309

Conversation

clairemcginty commented Mar 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 19, 2024 • edited

Codecov Report

kellen commented Mar 20, 2024

clairemcginty commented Mar 20, 2024

clairemcginty commented Mar 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RustedBones Apr 17, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clairemcginty Apr 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clairemcginty commented Mar 19, 2024 •

edited

codecov bot commented Mar 19, 2024 •

edited

RustedBones Apr 17, 2024 •

edited

clairemcginty Apr 30, 2024 •

edited