flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: Union/append performance question
Date Mon, 07 Sep 2015 14:58:37 GMT
Hi Flavio,

your example does not contain a union.

Union itself basically comes for free. However, if you have a lot of small
DataSet that you want to union, the plan can become very complex and might
cause overhead due to scheduling many small tasks. For example, it is
usually better to have one data source and input format that reads multiple
small files instead of adding one data source for each tiny file and apply
union to all data sources to get all data.

TL;DR; if your iteration count is only 3 as your example suggests you
should be fine. If it exceeds say 32 it might be worth thinking about your
program.

Cheers, Fabian



2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:

> Hi Stephan,
> thanks for the answer. Unfortunately I dind't understand if there's an
> alternative to union right now..
> My process is basically like this:
>
> Dataset x = ...
> while(loopCnt < 3){
>    x = x.join(y).where(0).equalTo(0).with());
>    accumulated = x.filter(t.f1 == 0);
>    x =  x.filter(t.f1!=0);
>    loopCnt++;
> }
>
> Best,
> Flavio
>
>
> On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <sewen@apache.org> wrote:
>
>> Union, like all operators, is lazy. When you call union, it only builds a
>> "union stream", that unions when you execute the task. So nothing is added
>> before you call "env.execute()"
>>
>> After you call "env.execute()" and then union again, you will re-execute
>> the entire history of computation to compute the data set that you union
>> with. Hence, for incremental computations, union() is probably not a good
>> choice, unless you persist intermediate data (seamless support for that is
>> WIP).
>>
>> Stephan
>>
>>
>> On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <pompermaier@okkam.it>
>> wrote:
>>
>>> Hi to all,
>>> I have a job where I have to incrementally add Tuples to a dataset (in a
>>> while loop).
>>> Is union() the best operator for this task or is there a more performant
>>> operator for this task?
>>> Does union affect the read of already existing elements or it just
>>> appends the new ones somewhere?
>>>
>>> Best,
>>> Flavio
>>>
>>>
>>>
>>
>

Mime
View raw message