flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Junghanns <martin.jungha...@gmx.net>
Subject Re: Containment Join Support
Date Fri, 17 Jul 2015 10:25:29 GMT
Hi Fabian, hi Stephen,

thanks for answering my question. Good hint with the list replication, I 
will benchmark this vs. cross + filter.

Best, Martin

Am 17.07.2015 um 11:17 schrieb Stephan Ewen:
> I would rewrite this to replicate the list into tuples:
>
> "foreach x in list: emit (x, list)"
> Then join on fields 0.
>
> This replicates the lists, but makes the join very efficient.
>
> On Fri, Jul 17, 2015 at 12:26 AM, Fabian Hueske <fhueske@gmail.com 
> <mailto:fhueske@gmail.com>> wrote:
>
>     Hi Martin,
>
>     good to hear that you like Flink :-)
>     AFAIK, there are no plans to add a containment join. The Flink
>     community is currently working on adding support for outer joins.
>     Regarding a containment join, I am not sure about the number of
>     use cases. I would rather try to implement it on top of Flink's
>     batch API instead of adding it as an internal feature/operator to
>     the system because this would touch a lot of things (API,
>     optimizer, operator implementation).
>
>     There might be better ways to implement a containment join than
>     using a cross and a filter.
>     - Do you know a distributed algorithm for containment joins? Maybe
>     it can be implemented with Flink's API.
>     - I guess, you are implementing a generic graph framework, but can
>     you make certain assumptions about the data such as relative sizes
>     of the inputs or avg/max size of the lists, etc.?
>
>     Contributions to Gelly (and Flink in general) are highly welcome.
>
>     Best, Fabian
>
>
>     2015-07-16 9:39 GMT+02:00 Martin Junghanns
>     <martin.junghanns@gmx.net <mailto:martin.junghanns@gmx.net>>:
>
>         Hi everyone,
>
>         at first, thanks for building this great framework! We are
>         using Flink
>         and especially Gelly for building a graph analytics stack
>         (gradoop.com <http://gradoop.com>).
>
>         I was wondering if there is a [planned] support for a
>         containment join
>         operator. Consider the following example:
>
>         DataSet<List<Int>> left := {[0, 1], [2, 3, 4], [5]}
>         DataSet<Tuple2<Int, Int>> right := {<0, 1>, <1, 0>, <2,
1>,
>         <5, 2>}
>
>         What I want to compute is
>
>         left.join(right).where(list).contains(tuple.f0) :=
>
>         {
>         <[0, 1], <0,1>>, <[0, 1], <1, 0>>,
>         <[2, 3, 4], <2, 1>>,
>         <[5], <5, 2>
>         }
>
>         At the moment, I am solving that using cross and filter, which
>         can be
>         expensive.
>
>         The generalization of that operator would be "set containment
>         join",
>         where you join if the right set is contained in the left set.
>
>         If there is a general need for that operator, I would also like to
>         contribute to its implementation.
>
>         But maybe, there is already another nice solution which I didn't
>         discover yet?
>
>         Any help would be appreciated. Especially since I would also
>         like to
>         contribute some of our graph operators (e.g., graph
>         summarization) back
>         to Flink/Gelly (current WIP state can be found here: [1]).
>
>         Thanks,
>
>         Martin
>
>
>         [1]
>         https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java
>
>
>
>


Mime
View raw message