flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vasiliki Kalavri <vasilikikala...@gmail.com>
Subject Re: Containment Join Support
Date Sat, 18 Jul 2015 21:15:57 GMT
Hi Martin,

I'm really glad to see that you've started using Gelly :)

I think that a graph summarization library method would be a great
addition!
Let me know if you need help and if you want to discuss ideas or other
methods.

Cheers,
Vasia.

On 17 July 2015 at 12:25, Martin Junghanns <martin.junghanns@gmx.net> wrote:

>  Hi Fabian, hi Stephen,
>
> thanks for answering my question. Good hint with the list replication, I
> will benchmark this vs. cross + filter.
>
> Best, Martin
>
>
> Am 17.07.2015 um 11:17 schrieb Stephan Ewen:
>
> I would rewrite this to replicate the list into tuples:
>
>  "foreach x in list: emit (x, list)"
> Then join on fields 0.
>
>  This replicates the lists, but makes the join very efficient.
>
> On Fri, Jul 17, 2015 at 12:26 AM, Fabian Hueske <fhueske@gmail.com> wrote:
>
>>    Hi Martin,
>>
>>  good to hear that you like Flink :-)
>>  AFAIK, there are no plans to add a containment join. The Flink community
>> is currently working on adding support for outer joins.
>>  Regarding a containment join, I am not sure about the number of use
>> cases. I would rather try to implement it on top of Flink's batch API
>> instead of adding it as an internal feature/operator to the system because
>> this would touch a lot of things (API, optimizer, operator implementation).
>>
>>  There might be better ways to implement a containment join than using a
>> cross and a filter.
>> - Do you know a distributed algorithm for containment joins? Maybe it can
>> be implemented with Flink's API.
>> - I guess, you are implementing a generic graph framework, but can you
>> make certain assumptions about the data such as relative sizes of the
>> inputs or avg/max size of the lists, etc.?
>>
>>  Contributions to Gelly (and Flink in general) are highly welcome.
>>
>>  Best, Fabian
>>
>>
>> 2015-07-16 9:39 GMT+02:00 Martin Junghanns <martin.junghanns@gmx.net>:
>>
>>> Hi everyone,
>>>
>>> at first, thanks for building this great framework! We are using Flink
>>>  and especially Gelly for building a graph analytics stack (gradoop.com
>>> ).
>>>
>>> I was wondering if there is a [planned] support for a containment join
>>> operator. Consider the following example:
>>>
>>> DataSet<List<Int>> left := {[0, 1], [2, 3, 4], [5]}
>>> DataSet<Tuple2<Int, Int>> right := {<0, 1>, <1, 0>, <2,
1>, <5, 2>}
>>>
>>> What I want to compute is
>>>
>>> left.join(right).where(list).contains(tuple.f0) :=
>>>
>>> {
>>> <[0, 1], <0,1>>, <[0, 1], <1, 0>>,
>>> <[2, 3, 4], <2, 1>>,
>>> <[5], <5, 2>
>>> }
>>>
>>> At the moment, I am solving that using cross and filter, which can be
>>> expensive.
>>>
>>> The generalization of that operator would be "set containment join",
>>> where you join if the right set is contained in the left set.
>>>
>>> If there is a general need for that operator, I would also like to
>>> contribute to its implementation.
>>>
>>> But maybe, there is already another nice solution which I didn't
>>> discover yet?
>>>
>>> Any help would be appreciated. Especially since I would also like to
>>> contribute some of our graph operators (e.g., graph summarization) back
>>> to Flink/Gelly (current WIP state can be found here: [1]).
>>>
>>> Thanks,
>>>
>>> Martin
>>>
>>>
>>> [1]
>>>
>>> https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java
>>>
>>>
>>>
>>
>
>

Mime
View raw message