Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

euphoria-core: map-side join support #41

Open
xitep opened this issue Feb 24, 2017 · 1 comment
Open

euphoria-core: map-side join support #41

xitep opened this issue Feb 24, 2017 · 1 comment
Labels

Comments

@xitep
Copy link
Contributor

xitep commented Feb 24, 2017

Map-side join is a special derivation of the Join operator, which can be turned into a plain MapElements operation, looking up the other side of the data to join in through a random-access API.

We'd like the write down such join operations through the existing Join operator. The special derivation mentioned above is merely a runtime characteristic which does not alter the semantics of the operation. In other words, a map-side join might be more efficient in certain situations, but the operator would still produce the results if carried out through a conventional (hash or sort based) join implementation. Prerequisites:

  • We'd need to provide support for data sets that can be accessed randomly.
  • We'd need to provide some way of hinting an executor to either do or not to do a map-side join "optimization" when it's possible.
  • The join operation would be effectively only a left or a right join (with zero or exactly one joined-value to be practical) - inner join semantics can be implemented through both of these.

On the other hand, if we are about to join two data sets which are already ordered by the "join key" and one or the other would provide the possibility to seek, we'd be able to optimize such a map-side join even more efficiently by turning the look-up into a re-seek. for a left-join, this would naturally support N join-values, which the random-access approach doesn't. As we can see, this is approach is a super-set of the random-access-approach. The canonical use-case would be to join to distinct databases with the same "primary key" (a typical key/value store delivers items ordered by the "key").

We clearly need more elaboration here, before starting with it. TBD

@je-ik
Copy link
Contributor

je-ik commented Feb 28, 2017

The main issue here is how to ensure that the operation is guaranteed to have the same outcome as if it would be executed in classical reduce-state-by-key manner. There is no problem with this when there is no windowing (i.e. batch windowing), but with some other windowing, it might be tricky, as it would probably require that the random access storage would "understand" the windowing and that it would be able to store multiple windowed values for the same key.

I'd suggest, that we restrict the scope of this issue only to non-windowed (batch windowed) joins. Then it might solve part of the comments in #38.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants