Add support for Zstd coders #5321

kellen · 2024-04-02T16:16:22Z

Adds

saveAsZstdDictionary to train a Zstd dictionary on some arbitrary SCollection[T]. Estimates the average size of elements T, collects n elements based on a target training set size, then trains and saves the Zstd dictionary.
A scala ZstdCoder object with transform Coders for the simple T or for each side of a (K, V)
command line argument to map from a type to a dictionary, causing instances of MyClass to get Zstd compression automagically. Probably fails if the type is parameterized. --zstdDictionary=com.spotify.scio.MyClass:gs://bucket/path/dict.bin

codecov · 2024-04-02T16:42:07Z

Codecov Report

Attention: Patch coverage is 58.82353% with 28 lines in your changes are missing coverage. Please review.

Project coverage is 61.08%. Comparing base (6ebf9c4) to head (60afcc4).
Report is 52 commits behind head on main.

Files	Patch %	Lines
...rc/main/scala/com/spotify/scio/io/ZstdDictIO.scala	0.00%	26 Missing ⚠️
...la/com/spotify/scio/coders/CoderMaterializer.scala	92.30%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5321      +/-   ##
==========================================
- Coverage   62.69%   61.08%   -1.61%     
==========================================
  Files         301      306       +5     
  Lines       10848    10993     +145     
  Branches      773      774       +1     
==========================================
- Hits         6801     6715      -86     
- Misses       4047     4278     +231

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kellen · 2024-04-02T21:07:57Z

scio-core/src/main/scala/com/spotify/scio/coders/CoderMaterializer.scala

+                        e
+                      )
+                  }
+                  className.replaceAll("\\$", ".") -> path


Maybe missing some other cases here?

Current version is because Class.forName requires OuterClass$InnerClass but the typename in the coders is OuterClass.InnerClass

A blacklist based on package name may also make sense

Added blacklist

scio-core/src/main/scala/com/spotify/scio/values/PairSCollectionFunctions.scala

scio-core/src/main/scala/com/spotify/scio/coders/CoderMaterializer.scala

scio-core/src/main/java/com/spotify/scio/options/ScioOptions.java

scio-core/src/main/scala/com/spotify/scio/coders/Coder.scala

RustedBones · 2024-04-04T07:59:00Z

scio-core/src/main/scala/com/spotify/scio/coders/CoderMaterializer.scala

+              s.split(":", 2).toList match {
+                case className :: path :: Nil =>


Default case should catch when we have more that 1 : separator

Suggested change

s.split(":", 2).toList match {

case className :: path :: Nil =>

s.split(":") match {

case Array(className, path) =>

The path part can contain a :, ala gs://.

I suppose we could use a variety of other characters, e.g. ,, but : makes the most sense as a 'mapping' IMO

RustedBones · 2024-04-11T09:58:57Z

scio-core/src/main/scala/com/spotify/scio/coders/BeamCoders.scala

+  private def unwrapZstd[T](options: CoderOptions, coder: BCoder[T]): BCoder[T] =
+    coder match {
+      case c: BZstdCoder[T] =>
+        val underlying = c.getCoderArguments.get(0).asInstanceOf[BCoder[T]]
+        unwrap(options, underlying)
+      case _ => coder
+    }


Why not factorizing thin in the unwrap since it is always chained ?

In e.g. getTupleCoders, we unwrap only the top-level Zstd coder for Tuple2[K, V] but want to retain any Zstd coder on K. Putting the zstd unwrapping in unwrap leads to the dict being discarded on many/every unwrap call.

RustedBones · 2024-04-11T10:06:34Z

scio-core/src/main/scala/com/spotify/scio/coders/instances/ZstdCoder.scala

+    keyDict: Array[Byte] = null,
+    valueDict: Array[Byte] = null


I'm not a fan of null defaults.
Also, we used kv for beam.KV in many placed, not tuples. This signature is a bit confusing.

Do we actually need this API since this is a regular Tuple2Coder with implicit (un)compressed key and value coders values ?

This is for user ergonomics if setting the coder manually.

sc.parallelize[(String, String)]( List(...) )(ZstdCoder.kv(valueDict = dictBytes))

Not providing this means users need to manually lift their dict into a Beam coder into a Scio coder. And using optionals here makes the API unpleasant IMO even though I in general agree with you.

Is ZstdCoder.tuple2 less annoying/confusing?

I'm lightly in favor of tuple2 (unfortunately) -- I think kv is already semantically "taken" by Beam KV class.

scio-core/src/main/scala/com/spotify/scio/io/ZstdDictIO.scala

RustedBones · 2024-04-11T10:11:20Z

scio-core/src/main/scala/com/spotify/scio/io/ZstdDictIO.scala

+    // training bytes may not exceed 2GiB a.k.a. the max value of an Int
+    val trainingBytesTargetActual: Int = Option(trainingBytesTarget).getOrElse {
+      val computed =
+        Try(Math.multiplyExact(zstdDictSizeBytes, 100)).toOption.getOrElse(Int.MaxValue)


do we want to catch this or let it fail ?

hmm actually I am unsure of what happens in this case, like if the training size is related to the elements added or the dict size itself. 2gb / 100 implies that the max zdict dictionary size should be 20ish megs, so maybe we should throw on that

+1 on throwing an exception

Re-reading, ~100x is just a recommendation, so using fewer items would probably decrease its effectiveness but wouldn't be catastrophic. Gonna throw anyway since recommendations argue against large dictionaries.

scio-core/src/main/scala/com/spotify/scio/io/ZstdDictIO.scala

RustedBones · 2024-04-30T15:19:37Z

scio-core/src/main/scala/com/spotify/scio/io/ZstdDictIO.scala

+        // estimate the sample rate we need by examining numElementsForSizeEstimation elements
+        val streamsCntSI = scoll.count.asSingletonSideInput(0L)
+        val sampleRateSI = scoll
+          .take(numElementsForSizeEstimation)
+          .map(v => toBytes(v).length)
+          .sum
+          .withSideInputs(streamsCntSI)
+          .map { case (totalSize, ctx) =>
+            val avgSize = totalSize / numElementsForSizeEstimation
+            val targetNumElements = trainingBytesTargetActual / avgSize
+            val sampleRate = targetNumElements.toDouble / ctx(streamsCntSI)
+            logger.info(s"Computed sample rate for Zstd dictionary: ${sampleRate}")
+            sampleRate
+          }
+          .toSCollection
+          .asSingletonSideInput
+
+        scoll
+          .withSideInputs(sampleRateSI)
+          .flatMap {
+            case (s, ctx) if new Random().nextDouble() <= ctx(sampleRateSI) =>
+              Some(toBytes(s))
+            case _ => None
+          }
+          .toSCollection
+          .keyBy(_ => ())
+          .groupByKey
+          .map { case (_, elements) =>


Why not using sample with sampleSize as the result should fit in memory anyway ?
We'll just have to use Int for numElementsForSizeEstimation

Suggested change

// estimate the sample rate we need by examining numElementsForSizeEstimation elements

val streamsCntSI = scoll.count.asSingletonSideInput(0L)

val sampleRateSI = scoll

.take(numElementsForSizeEstimation)

.map(v => toBytes(v).length)

.sum

.withSideInputs(streamsCntSI)

.map { case (totalSize, ctx) =>

val avgSize = totalSize / numElementsForSizeEstimation

val targetNumElements = trainingBytesTargetActual / avgSize

val sampleRate = targetNumElements.toDouble / ctx(streamsCntSI)

logger.info(s"Computed sample rate for Zstd dictionary: ${sampleRate}")

sampleRate

}

.toSCollection

.asSingletonSideInput

scoll

.withSideInputs(sampleRateSI)

.flatMap {

case (s, ctx) if new Random().nextDouble() <= ctx(sampleRateSI) =>

Some(toBytes(s))

case _ => None

}

.toSCollection

.keyBy(_ => ())

.groupByKey

.map { case (_, elements) =>

scoll

.sample(numElementsForSizeEstimation)

.map { elements =>

This is so users don't need to know the average size of elements in the pipeline; we estimate it based on numElementsForSizeEstimation. Taking only those elements is not useful, since you may need many more elements to actually train.

Maybe smth like this can help #5352 ?

Yeah. I imagine doing a bunch of priority queue merges is somewhat less efficient (though more accurate) but perhaps doesn't matter in this case since we have small amounts of data in the end.

scio-core/src/test/scala/com/spotify/scio/coders/CoderTestUtils.scala

scio-core/src/main/scala/com/spotify/scio/coders/CoderMaterializer.scala

kellen added 11 commits March 23, 2024 21:34

wip

8f10f13

Add scollection api and scaladocs

bab562a

Add manual api

c7fe851

wip

acb05b2

fix npe

53de72c

wip

43aee3d

Just use classname instead

c137eaf

wip

3ddef5b

tests

fd13b5f

wip

f70b394

wip

9907aac

kellen added 2 commits April 2, 2024 12:46

2.12 *shakes fist*

6b4c3b8

exclude

a4c37ca

kellen commented Apr 2, 2024

View reviewed changes

kellen commented Apr 3, 2024

View reviewed changes

scio-core/src/main/scala/com/spotify/scio/values/PairSCollectionFunctions.scala Outdated Show resolved Hide resolved

kellen added 4 commits April 4, 2024 11:16

Use transform coder instead

254f3e5

Use transform coder instead

c018784

sob

3e9c98e

lol

06e1a62

clairemcginty reviewed Apr 4, 2024

View reviewed changes

scio-core/src/main/scala/com/spotify/scio/coders/CoderMaterializer.scala Outdated Show resolved Hide resolved

clairemcginty reviewed Apr 4, 2024

View reviewed changes

scio-core/src/main/scala/com/spotify/scio/coders/CoderMaterializer.scala Outdated Show resolved Hide resolved

Use object id

45fad08

RustedBones requested changes Apr 11, 2024

View reviewed changes

kellen added 3 commits April 30, 2024 09:47

comments

09daf2f

blacklist

4d511a5

Merge branch 'main' into kellen/shufflezip

af516be

RustedBones reviewed Apr 30, 2024

View reviewed changes

kellen added 2 commits April 30, 2024 13:19

fix

c665d59

comments

ae9a235

kellen mentioned this pull request May 2, 2024

[scio-core](feature) Sample SCollection with max weight #5352

Merged

RustedBones approved these changes May 16, 2024

View reviewed changes

kellen added 3 commits May 17, 2024 16:14

don't cop out

c77aa91

fml

4375763

compat

60afcc4

kellen merged commit fe93831 into main May 20, 2024
12 checks passed

kellen deleted the kellen/shufflezip branch May 20, 2024 19:48

clairemcginty pushed a commit that referenced this pull request May 30, 2024

Add support for Zstd coders (#5321)

0ea2c26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Zstd coders #5321

Add support for Zstd coders #5321

kellen commented Apr 2, 2024 •

edited

codecov bot commented Apr 2, 2024 •

edited

kellen Apr 2, 2024 •

edited

kellen Apr 8, 2024

kellen Apr 30, 2024

RustedBones Apr 4, 2024

kellen Apr 29, 2024

RustedBones Apr 11, 2024

kellen Apr 29, 2024

RustedBones Apr 11, 2024

kellen Apr 29, 2024

clairemcginty May 1, 2024

RustedBones Apr 11, 2024

kellen Apr 29, 2024

clairemcginty May 1, 2024

kellen May 1, 2024

RustedBones Apr 30, 2024 •

edited

kellen Apr 30, 2024

RustedBones May 2, 2024

kellen May 3, 2024 •

edited

		s.split(":", 2).toList match {
		case className :: path :: Nil =>

Add support for Zstd coders #5321

Add support for Zstd coders #5321

Conversation

kellen commented Apr 2, 2024 • edited

codecov bot commented Apr 2, 2024 • edited

Codecov Report

kellen Apr 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RustedBones Apr 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kellen May 3, 2024 • edited

Choose a reason for hiding this comment

kellen commented Apr 2, 2024 •

edited

codecov bot commented Apr 2, 2024 •

edited

kellen Apr 2, 2024 •

edited

RustedBones Apr 30, 2024 •

edited

kellen May 3, 2024 •

edited