Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Make Expression Serializable #37

Open
bitdivine opened this issue Sep 26, 2018 · 1 comment
Open

Feature request: Make Expression Serializable #37

bitdivine opened this issue Sep 26, 2018 · 1 comment

Comments

@bitdivine
Copy link

bitdivine commented Sep 26, 2018

It would be nice to be able to use jmespath in Spark. The main requirement for this is that code needs to be serializable so that it can be sent to worker nodes. Jackson is serializable, for example, so one can define a "user defined function" or udf like this:

import org.apache.spark.sql.functions.udf

val mapper = new ObjectMapper()

// Function that will se serialised and sent to the worker nodes:
val processJson = udf((json: String) => mapper.readValue[SomeClass](json, classOf(SomeClass)) ....

someDataFrame.withColumn("processedJson", processJson(someExistingColumn))

But the same fails for jmespath with a serialisation error:

val runtime = new JacksonRuntime()
val query = runtime.compile("""[?type=='pineapple']""")
val jmespathSelection = udf((json: String): Seq[String] =
   query.search(runtime.parseString(json)) match { case an: ArrayNode => an.elements.asScala.toSeq.map(runtime.toString)}

Happy to work on this when I have time or buy a decent amount of beer for anyone else who gets there first. It is often just a matter of marking the classes as serialisable.

@iconara
Copy link
Collaborator

iconara commented Sep 26, 2018

Getting it to work on Spark seems like a good thing to do, and it shouldn't be too hard. I don't think there is anything that wouldn't be serializable.

The best thing for me would obviously be if you had time to work on it, since you have the code and environment to test it properly. I can probably whip up a Spark environment and get something working, but I couldn't promise that I'd cover all cases. If you gave me some example code that should work but doesn't that would be a start, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants