Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilities for custom partitioning #31

Open
rnowling opened this issue Sep 3, 2015 · 7 comments
Open

Utilities for custom partitioning #31

rnowling opened this issue Sep 3, 2015 · 7 comments

Comments

@rnowling
Copy link
Contributor

rnowling commented Sep 3, 2015

Custom partitioning can be make certain operations easier (e.g., grouping data to control mapping between data and files). We should evaluate the space of how custom partitioning can be used and provide utilities to make this easier. Maybe good to define high level tasks that need customer partitioning and present interfaces for those as well.

@erikerlandson
Copy link
Collaborator

With respect to Spark, you can define custom partitioners, like I did here:
https://github.com/roofmonkey/cambio/blob/eje/common/src/main/scala/com/redhat/et/cambio/common/trees.scala#L1469

I'm not sure if there is any added abstraction you can layer on top of Partitioner that would make it easier.

Another way to do it is to use Spark's native key-based partitioning, and manipulate the keys themselves strategically. My intuition is that this might be preferable, as you sit on top of any lower-level optimizations that the spark community comes up with.

@willb
Copy link
Member

willb commented Sep 4, 2015

Another way to do it is to use Spark's native key-based partitioning, and manipulate the keys themselves strategically.

This is what I've suggested in the past as well.

@erikerlandson
Copy link
Collaborator

Potential efficiency benefits aside, manipulating keys is an easier way to think about the problem, and it is totally general, since there is no restriction on key data types except an ordering relation.

Since the key space is unrestricted, there are going to be all manner of possible formalisms you can embed within that space.

@willb
Copy link
Member

willb commented Sep 4, 2015

@erikerlandson Agreed; the issue AIUI is that we'd need to guarantee (for the use cases that @rnowling immediately cares about, at least) that each partition would have exactly one key. So the naïve solution is a little clunky: count the number of keys, repartition to that number, and make a HashPartitioner that assigns each key to a unique partition. I think the easiest way to do the second step is rdd.intersect(rdd, partitioner) but there might be a better option.

@erikerlandson
Copy link
Collaborator

Well, maybe the cut example is informative for that. You can set up a key space, and instantiate a partitioner that maps each key to its own partition.

@erikerlandson
Copy link
Collaborator

In that case, the map would map each key to a unique parition, and numPartitions would just return the size of that map

@erikerlandson
Copy link
Collaborator

Maybe a formalism built around RDDs that specifically map each key to unique partition, using a partitioner built around the ideas above, combined with interesting abstractions for manipulating they keys themselves, which is where the interesting action would be, and where the programming effort would focus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants