flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: Outer-join operator integration with DataSet API (FLINK-2576)
Date Thu, 03 Sep 2015 13:05:51 GMT
Hi Johann, hi Ricky,

Thanks for reaching out to the mailing list before taking action!

I do also prefer option c.
In principle, all inner join strategies can also be applied for all outer
joins (for some hash strategies, a special HashTable implementation is
required).
I propose to add two methods for each join type, with and without JoinHint.
The JoinHint variant should fail for strategies which are not applicable,
yet (all hash-based).

You are right that there is quite a bit of duplicated code in the API code.
It could really use some refactoring and deduplication.
However, I see this as a separate issue and would also go for the separate
operator approach in the beginning.
Maybe you can implement a common operator base for all outer joins.

If you are interested, you can help refactoring the API as a follow up
issue ;-)

Cheers, Fabian



2015-09-01 18:34 GMT+02:00 Till Rohrmann <trohrmann@apache.org>:

> Hi Johann,
>
> I'd prefer 1.c, because the different join variants are semantically
> different and this should be IMO reflected in the API. Moreover, the
> `JoinHints` are used to give hints for the selection of the underlying
> strategy for the different join variants. For `leftOuterJoin` you could
> either use a sort-merge-join or a hash-join where the left side is the
> probe side. For the `rightOuterJoin` you will have different strategies:
> Either you use a sort-merge-join or a hash-join where the right side is the
> probe side. Thus, the `JoinHints` for both variants cannot be the same.
>
> Since I'm not too familiar with the code base of the join operator
> implementation, I cannot give you a really good advice for question 2. In
> general it would be good to avoid as much redundant code as possible since
> it makes the code base harder to maintain. But if this should entail a
> disproportional larger work effort, then I think it's also fine to follow
> the CoGroup way.
>
> Cheers,
> Till
>
> On Tue, Sep 1, 2015 at 6:02 PM, Johann Kovacs <me@jkovacs.de> wrote:
>
> > Hi all,
> >
> > we (Ricky and I) are currently working on the outer join
> > implementation for Flink (FLINK-687, previous pull requests #907,
> > #1052).
> >
> > I am now looking for advice on 2 issues specifically regarding the
> > integration of the outer join operator with the DataSet API
> > (FLINK-2576).
> >
> > 1. There are several options of exposing the operator to the user via
> > the DataSet API and I'd just like to hear your preferences between the
> > following options (or other suggestions if I missed something):
> >   a. DataSet#outerJoin(DataSet other, OuterJoinType outerJoinType)
> > [i.e. asking the user to pass an enum left-, right-, or full outer
> > join]
> >   b. DataSet#join(DataSet other, JoinType joinType)  [i.e. like option
> > a, but generalized to work for all: inner-, left-, right-, full outer
> > joins]
> >   c. DataSet#left/right/fullOuterJoin(DataSet other)  [i.e. a fully
> > qualified method for each operator]
> >
> > Personally I'm partial towards options a and c, although a does have
> > the advantage of not blowing up the API too much (imagine adding
> > additional optional parameters, such as JoinHint, to each of option
> > c's methods).
> >
> > 2. I would have liked to implement the outer join operator API by
> > reusing as much code & functionality as possible from
> > org.apache.flink.api.java.operators.JoinOperator and JoinOperatorBase
> > (especially all the KeySelector, semantic annotations, and tuple
> > unwrapping stuff...) but I feel like this would bite me sooner or
> > later due to incompatibilities or other minor differences between the
> > behaviour of those operators.
> > I imagine this is the reason why lots of this functionality was
> > duplicated for the CoGroup operator implementation. Which makes me
> > think I should probably go the same route and duplicate the necessary
> > APIs, and then maybe try to refactor later?
> > Any opinions or hints regarding this?
> >
> > Thanks in advance,
> > Johann
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message