flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: Union of multiple datasets vs Join
Date Mon, 22 Dec 2014 13:32:36 GMT
Ok thanks Fabian. I'd like just to know the internals of the union of
multiple datasets (partitioning, distribution among server, memory/disk,
etc..). Do you have any ref to this?

Thanks in advance,
Flavio

On Mon, Dec 22, 2014 at 12:46 PM, Fabian Hueske <fhueske@apache.org> wrote:

> Follow the first approach.
> Joins are expensive, union comes for free.
>
> Best, Fabian
>
> 2014-12-22 11:47 GMT+01:00 Flavio Pompermaier <pompermaier@okkam.it>:
>
>> Hi guys,
>>
>> In my use case I have multiple Datasets with the same structure (e.g.
>> Tuple3) and I want to produce an output Dataset containing all Tuple3
>> grouped by the first field (0).
>> I can obtain the same results performing a union of all datasets and then
>> a group by (simplest implementation) or join all of them pairwise
>> (((A->B)->C)->D)..) or I don't know if there is any other solution. When
>> should I use the first or the second approach? Could you help me in
>> figuring out the internals of the two approaches? I always have some fear
>> when using multiple joins when I don't know exactly their size..
>>
>> Best,
>> Flavio
>>
>
>

Mime
View raw message