flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: Containment Join Support
Date Fri, 17 Jul 2015 09:17:07 GMT
I would rewrite this to replicate the list into tuples:

"foreach x in list: emit (x, list)"
Then join on fields 0.

This replicates the lists, but makes the join very efficient.

On Fri, Jul 17, 2015 at 12:26 AM, Fabian Hueske <fhueske@gmail.com> wrote:

> Hi Martin,
>
> good to hear that you like Flink :-)
> AFAIK, there are no plans to add a containment join. The Flink community
> is currently working on adding support for outer joins.
> Regarding a containment join, I am not sure about the number of use cases.
> I would rather try to implement it on top of Flink's batch API instead of
> adding it as an internal feature/operator to the system because this would
> touch a lot of things (API, optimizer, operator implementation).
>
> There might be better ways to implement a containment join than using a
> cross and a filter.
> - Do you know a distributed algorithm for containment joins? Maybe it can
> be implemented with Flink's API.
> - I guess, you are implementing a generic graph framework, but can you
> make certain assumptions about the data such as relative sizes of the
> inputs or avg/max size of the lists, etc.?
>
> Contributions to Gelly (and Flink in general) are highly welcome.
>
> Best, Fabian
>
>
> 2015-07-16 9:39 GMT+02:00 Martin Junghanns <martin.junghanns@gmx.net>:
>
>> Hi everyone,
>>
>> at first, thanks for building this great framework! We are using Flink
>> and especially Gelly for building a graph analytics stack (gradoop.com).
>>
>> I was wondering if there is a [planned] support for a containment join
>> operator. Consider the following example:
>>
>> DataSet<List<Int>> left := {[0, 1], [2, 3, 4], [5]}
>> DataSet<Tuple2<Int, Int>> right := {<0, 1>, <1, 0>, <2,
1>, <5, 2>}
>>
>> What I want to compute is
>>
>> left.join(right).where(list).contains(tuple.f0) :=
>>
>> {
>> <[0, 1], <0,1>>, <[0, 1], <1, 0>>,
>> <[2, 3, 4], <2, 1>>,
>> <[5], <5, 2>
>> }
>>
>> At the moment, I am solving that using cross and filter, which can be
>> expensive.
>>
>> The generalization of that operator would be "set containment join",
>> where you join if the right set is contained in the left set.
>>
>> If there is a general need for that operator, I would also like to
>> contribute to its implementation.
>>
>> But maybe, there is already another nice solution which I didn't
>> discover yet?
>>
>> Any help would be appreciated. Especially since I would also like to
>> contribute some of our graph operators (e.g., graph summarization) back
>> to Flink/Gelly (current WIP state can be found here: [1]).
>>
>> Thanks,
>>
>> Martin
>>
>>
>> [1]
>>
>> https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java
>>
>>
>>
>

Mime
View raw message