flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: Union of multiple datasets vs Join
Date Tue, 17 Mar 2015 16:58:10 GMT
Hi Fabian,
I was trying to use the strategy you suggested with flink 0.8.1 but it
seems that the union of the datasets cannot be created programmatically
because the union operator gives a name to the generated dataset that is
the name of the calling function so that  only the first dataset is read.
My code looks like:

private static DataSet<Tuple6<...> getSourceDs(ExecutionEnvironment env, final
String outputGraph, List<String> tableNames) {
DataSet<Tuple6<...>> ret = null;
for (String tableName : tableNames) {
DataSet<Tuple6<...>> sourceDs = env.createInput(new
MyTableInputFormat(tableName))
                        ....

if(ret==null)
ret = sourceDs;
else
ret.union(sourceDs);
               }
              return ret;
       }

Is this a bug or am I'm doing something wrong?
Thanks in advance,
Flavio

On Mon, Dec 22, 2014 at 2:42 PM, <fhueske@gmail.com> wrote:

>  Union is just combining data from multiple sources into a single
> dataset.
> That’s it. No memory, no disk involved.
>
> In you case you have
>
> input1.union(input2).groupBy(1).reduce(…)
>
> This will translate into:
>
> input1 -> repartition ->
>                                         read-both-inputs ->  sort -> reduce
> input2 -> repartition ->
>
> So, in your case not even additional network transfer is involved, because
> both data sets would need to be partitioned for the reduce anyway.
>
> Note, union in Flink has SQL union-all semantics, i.e., there is
> not removal of duplicates.
>
> Cheers, Fabian
>
> *From:* Flavio Pompermaier <pompermaier@okkam.it>
> *Sent:* ‎Monday‎, ‎22‎. ‎December‎, ‎2014 ‎14‎:‎32
> *To:* user@flink.incubator.apache.org
>
> Ok thanks Fabian. I'd like just to know the internals of the union of
> multiple datasets (partitioning, distribution among server, memory/disk,
> etc..). Do you have any ref to this?
>
> Thanks in advance,
> Flavio
>
> On Mon, Dec 22, 2014 at 12:46 PM, Fabian Hueske <fhueske@apache.org>
> wrote:
>
>> Follow the first approach.
>> Joins are expensive, union comes for free.
>>
>> Best, Fabian
>>
>> 2014-12-22 11:47 GMT+01:00 Flavio Pompermaier <pompermaier@okkam.it>:
>>
>>> Hi guys,
>>>
>>> In my use case I have multiple Datasets with the same structure (e.g.
>>> Tuple3) and I want to produce an output Dataset containing all Tuple3
>>> grouped by the first field (0).
>>> I can obtain the same results performing a union of all datasets and
>>> then a group by (simplest implementation) or join all of them pairwise
>>> (((A->B)->C)->D)..) or I don't know if there is any other solution.
When
>>> should I use the first or the second approach? Could you help me in
>>> figuring out the internals of the two approaches? I always have some fear
>>> when using multiple joins when I don't know exactly their size..
>>>
>>> Best,
>>> Flavio
>>>
>>
>>
>

Mime
View raw message