flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: Union/append performance question
Date Mon, 07 Sep 2015 18:00:10 GMT
In that case you should go with union.

2015-09-07 19:06 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:

> 3 or 4 usually..
> On 7 Sep 2015 18:39, "Fabian Hueske" <fhueske@gmail.com> wrote:
>
>> And how many unions would your program use if you would follow the
>> union-in-loop approach?
>>
>> 2015-09-07 18:31 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:
>>
>>> In the order of 10 GB..
>>>
>>> On Mon, Sep 7, 2015 at 6:14 PM, Fabian Hueske <fhueske@gmail.com> wrote:
>>>
>>>> Accumulators can be used to collect records, but they are not designed
>>>> to hold large amounts of data.
>>>> It might work up to a certain point (~10MB) and fail beyond that.
>>>>
>>>> How many unions do you plan to use in your program?
>>>>
>>>>
>>>>
>>>> 2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:
>>>>
>>>>> ok thanks. are there any alternatives to that?may I use accumulators
>>>>> for that?
>>>>> On 7 Sep 2015 17:47, "Fabian Hueske" <fhueske@gmail.com> wrote:
>>>>>
>>>>>> If the loop count of 3 is fixed (or not significantly larger), union
>>>>>> should be fine.
>>>>>>
>>>>>> 2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:
>>>>>>
>>>>>>> Sorry the program has a union at   accumulated = accumulated.union(x.filter(t.f1
>>>>>>> == 0))
>>>>>>>
>>>>>>> On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <fhueske@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Flavio,
>>>>>>>>
>>>>>>>> your example does not contain a union.
>>>>>>>>
>>>>>>>> Union itself basically comes for free. However, if you have
a lot
>>>>>>>> of small DataSet that you want to union, the plan can become
very complex
>>>>>>>> and might cause overhead due to scheduling many small tasks.
For example,
>>>>>>>> it is usually better to have one data source and input format
that reads
>>>>>>>> multiple small files instead of adding one data source for
each tiny file
>>>>>>>> and apply union to all data sources to get all data.
>>>>>>>>
>>>>>>>> TL;DR; if your iteration count is only 3 as your example
suggests
>>>>>>>> you should be fine. If it exceeds say 32 it might be worth
thinking about
>>>>>>>> your program.
>>>>>>>>
>>>>>>>> Cheers, Fabian
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it
>>>>>>>> >:
>>>>>>>>
>>>>>>>>> Hi Stephan,
>>>>>>>>> thanks for the answer. Unfortunately I dind't understand
if
>>>>>>>>> there's an alternative to union right now..
>>>>>>>>> My process is basically like this:
>>>>>>>>>
>>>>>>>>> Dataset x = ...
>>>>>>>>> while(loopCnt < 3){
>>>>>>>>>    x = x.join(y).where(0).equalTo(0).with());
>>>>>>>>>    accumulated = x.filter(t.f1 == 0);
>>>>>>>>>    x =  x.filter(t.f1!=0);
>>>>>>>>>    loopCnt++;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Flavio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <sewen@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Union, like all operators, is lazy. When you call
union, it only
>>>>>>>>>> builds a "union stream", that unions when you execute
the task. So nothing
>>>>>>>>>> is added before you call "env.execute()"
>>>>>>>>>>
>>>>>>>>>> After you call "env.execute()" and then union again,
you will
>>>>>>>>>> re-execute the entire history of computation to compute
the data set that
>>>>>>>>>> you union with. Hence, for incremental computations,
union() is probably
>>>>>>>>>> not a good choice, unless you persist intermediate
data (seamless support
>>>>>>>>>> for that is WIP).
>>>>>>>>>>
>>>>>>>>>> Stephan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier
<
>>>>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi to all,
>>>>>>>>>>> I have a job where I have to incrementally add
Tuples to a
>>>>>>>>>>> dataset (in a while loop).
>>>>>>>>>>> Is union() the best operator for this task or
is there a more
>>>>>>>>>>> performant operator for this task?
>>>>>>>>>>> Does union affect the read of already existing
elements or it
>>>>>>>>>>> just appends the new ones somewhere?
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Flavio
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>>
>>

Mime
View raw message