You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be nice to be able to use jmespath in Spark. The main requirement for this is that code needs to be serializable so that it can be sent to worker nodes. Jackson is serializable, for example, so one can define a "user defined function" or udf like this:
import org.apache.spark.sql.functions.udf
val mapper = new ObjectMapper()
// Function that will se serialised and sent to the worker nodes:
val processJson = udf((json: String) => mapper.readValue[SomeClass](json, classOf(SomeClass)) ....
someDataFrame.withColumn("processedJson", processJson(someExistingColumn))
But the same fails for jmespath with a serialisation error:
val runtime = new JacksonRuntime()
val query = runtime.compile("""[?type=='pineapple']""")
val jmespathSelection = udf((json: String): Seq[String] =
query.search(runtime.parseString(json)) match { case an: ArrayNode => an.elements.asScala.toSeq.map(runtime.toString)}
Happy to work on this when I have time or buy a decent amount of beer for anyone else who gets there first. It is often just a matter of marking the classes as serialisable.
The text was updated successfully, but these errors were encountered:
Getting it to work on Spark seems like a good thing to do, and it shouldn't be too hard. I don't think there is anything that wouldn't be serializable.
The best thing for me would obviously be if you had time to work on it, since you have the code and environment to test it properly. I can probably whip up a Spark environment and get something working, but I couldn't promise that I'd cover all cases. If you gave me some example code that should work but doesn't that would be a start, though.
It would be nice to be able to use jmespath in Spark. The main requirement for this is that code needs to be serializable so that it can be sent to worker nodes. Jackson is serializable, for example, so one can define a "user defined function" or udf like this:
But the same fails for jmespath with a serialisation error:
Happy to work on this when I have time or buy a decent amount of beer for anyone else who gets there first. It is often just a matter of marking the classes as serialisable.
The text was updated successfully, but these errors were encountered: