Utilities for custom partitioning #31

rnowling · 2015-09-03T15:59:14Z

Custom partitioning can be make certain operations easier (e.g., grouping data to control mapping between data and files). We should evaluate the space of how custom partitioning can be used and provide utilities to make this easier. Maybe good to define high level tasks that need customer partitioning and present interfaces for those as well.

erikerlandson · 2015-09-04T14:18:01Z

With respect to Spark, you can define custom partitioners, like I did here:
https://github.com/roofmonkey/cambio/blob/eje/common/src/main/scala/com/redhat/et/cambio/common/trees.scala#L1469

I'm not sure if there is any added abstraction you can layer on top of Partitioner that would make it easier.

Another way to do it is to use Spark's native key-based partitioning, and manipulate the keys themselves strategically. My intuition is that this might be preferable, as you sit on top of any lower-level optimizations that the spark community comes up with.

willb · 2015-09-04T14:18:49Z

Another way to do it is to use Spark's native key-based partitioning, and manipulate the keys themselves strategically.

This is what I've suggested in the past as well.

erikerlandson · 2015-09-04T14:24:45Z

Potential efficiency benefits aside, manipulating keys is an easier way to think about the problem, and it is totally general, since there is no restriction on key data types except an ordering relation.

Since the key space is unrestricted, there are going to be all manner of possible formalisms you can embed within that space.

willb · 2015-09-04T14:27:17Z

@erikerlandson Agreed; the issue AIUI is that we'd need to guarantee (for the use cases that @rnowling immediately cares about, at least) that each partition would have exactly one key. So the naïve solution is a little clunky: count the number of keys, repartition to that number, and make a HashPartitioner that assigns each key to a unique partition. I think the easiest way to do the second step is rdd.intersect(rdd, partitioner) but there might be a better option.

erikerlandson · 2015-09-04T14:29:42Z

Well, maybe the cut example is informative for that. You can set up a key space, and instantiate a partitioner that maps each key to its own partition.

erikerlandson · 2015-09-04T14:31:55Z

In that case, the map would map each key to a unique parition, and numPartitions would just return the size of that map

erikerlandson · 2015-09-04T14:50:19Z

Maybe a formalism built around RDDs that specifically map each key to unique partition, using a partitioner built around the ideas above, combined with interesting abstractions for manipulating they keys themselves, which is where the interesting action would be, and where the programming effort would focus.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utilities for custom partitioning #31

Utilities for custom partitioning #31

rnowling commented Sep 3, 2015

erikerlandson commented Sep 4, 2015

willb commented Sep 4, 2015

erikerlandson commented Sep 4, 2015

willb commented Sep 4, 2015

erikerlandson commented Sep 4, 2015

erikerlandson commented Sep 4, 2015

erikerlandson commented Sep 4, 2015

Utilities for custom partitioning #31

Utilities for custom partitioning #31

Comments

rnowling commented Sep 3, 2015

erikerlandson commented Sep 4, 2015

willb commented Sep 4, 2015

erikerlandson commented Sep 4, 2015

willb commented Sep 4, 2015

erikerlandson commented Sep 4, 2015

erikerlandson commented Sep 4, 2015

erikerlandson commented Sep 4, 2015