Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

euphoria-flink: data type transparency #45

Open
2 tasks
xitep opened this issue Mar 3, 2017 · 1 comment
Open
2 tasks

euphoria-flink: data type transparency #45

xitep opened this issue Mar 3, 2017 · 1 comment

Comments

@xitep
Copy link
Contributor

xitep commented Mar 3, 2017

Execution engines can do much better at optimizations if they transparently know what types they are working with. Efforts in the Spark as well as the Flink community proof optimization potentials in this regard. As a layer above such execution engines Euphoria must provide type specific information to executors in order to stay relevant (in terms of performance). In this ticket we'll focus on Flink.

Motivation

The primary background for this ticket was triggered by an attempt to shave off some of the overhead mentioned in #13 and #14.

Experiment

Using opaque data types with general purpose serialization has considerable, negative effects on optimizations that Flink tries to apply by default. I was able to see the effect in an experiment, where the "core" operation of the flow is the following (basically just a windowed word-count):

      ReduceByKey
            .of(input)
            .keyBy(Pair::getSecond)
            .valueBy(e -> 1L)
            .combineBy(Sums.ofLongs())
            .windowBy(Time.of(shortInterval), Pair::getFirst)
            .output();

Changing Euphoria's flink batch executor such that it uses Flink's native and Flink's pojo based serializers for the types involved during the reduce operation, squeezed out about half of the original execution time of the program. The amount of data shuffled was approximately the same. Unfortunately, such an approach required me to explicitly provide the return type of the .keyBy function to Euphoria's FlinkExecutor's internals. I was not able to derive the return type of a lambda in an automatic manner without the user having to explicitly state it.

What to do next

  • Find a non-verbose way for clients to provide required type information to the executor translators, i.e. return types of UDFs (we might not really succeed here and might end up with some verbose construct; maybe we can make the verbose constructs and the implied optimizations merely an opt-in?)
  • Leverage the types in Euphoria's Flink batch and stream executors.
    • Leverage type information for objects used internally in both flink executors to avoid general purpose serialization by kryo when possible.
    • Using TupleX instead of Pair and Tripple is a good start.
    • For internal types, utilizing POJOs instead of opaque data types is another easy gain.
    • Note: it will not help much if we switch over to Tuple but fail to provide type information of the actually used fields.
    • Attention: proper delegation of the window types from one operator to another due to attached windowing may require substantial changes to the current executor code.
@horkyada
Copy link
Contributor

We shouldn't forget that to allow Flink to fully operate on serialized data, it needs to determine the right offset of the desired data. Flink does it through the possibility of setting the number of the field(s) in tuples and names of the filed(s) in POJO. E.g.:

 DataStream<MyPojo<String, Integer>> longTrends = prepared
        .keyBy("query")
        .timeWindow(longInterval, shortInterval)
        .sum("count");

This should be probably covered by Euphoria too.

xitep pushed a commit that referenced this issue Apr 12, 2017
xitep pushed a commit that referenced this issue Apr 12, 2017
xitep pushed a commit that referenced this issue Apr 24, 2017
xitep pushed a commit that referenced this issue Apr 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants