flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fabian Hueske (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-685) Add support for semi-joins
Date Fri, 17 Jul 2015 20:07:04 GMT

    [ https://issues.apache.org/jira/browse/FLINK-685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631824#comment-14631824

Fabian Hueske commented on FLINK-685:

Hi [~pietropinoli],

that's a good question ;-) 
If we decide to implement the feature, there are basically two ways to implement it:
1) light-weight on top of a CoGroup (and Map). This is basically using Flink's existing API
but won't be super efficient.
2) with a custom hash-based execution strategy. This means extending the optimizer and adding
runtime code that operates on binary data, a lot of implementation effort. However, it should
have much better performance than option 1).

I haven't seen somebody asking for this feature, so I would opt for option 1) if we want to
add this feature. 
We can add a more efficient execution strategy later if there is need for better performance.

> Add support for semi-joins
> --------------------------
>                 Key: FLINK-685
>                 URL: https://issues.apache.org/jira/browse/FLINK-685
>             Project: Flink
>          Issue Type: New Feature
>            Reporter: GitHub Import
>            Priority: Minor
>              Labels: github-import
>             Fix For: pre-apache
> A semi-join is basically a join filter. One input is "filtering" and the other one is
> A tuple of the "filtered" input is emitted exactly once if the "filtering" input has
one (ore more) tuples with matching join keys. That means that the output of a semi-join has
the same type as the "filtered" input and the "filtering" input is completely discarded.
> In order to support a semi-join, we need to add an additional physical execution strategy,
that ensures, that a tuple of the "filtered" input is emitted only once if the "filtering"
input has more than one tuple with matching keys. Furthermore, a couple of optimizations compared
to standard joins can be done such as storing only keys and not the full tuple of the "filtering"
input in a hash table.
> ---------------- Imported from GitHub ----------------
> Url: https://github.com/stratosphere/stratosphere/issues/685
> Created by: [fhueske|https://github.com/fhueske]
> Labels: enhancement, java api, runtime, 
> Milestone: Release 0.6 (unplanned)
> Created at: Mon Apr 14 12:05:29 CEST 2014
> State: open

This message was sent by Atlassian JIRA

View raw message