flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: Union/append performance question
Date Mon, 07 Sep 2015 16:31:02 GMT
In the order of 10 GB..

On Mon, Sep 7, 2015 at 6:14 PM, Fabian Hueske <fhueske@gmail.com> wrote:

> Accumulators can be used to collect records, but they are not designed to
> hold large amounts of data.
> It might work up to a certain point (~10MB) and fail beyond that.
>
> How many unions do you plan to use in your program?
>
>
>
> 2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:
>
>> ok thanks. are there any alternatives to that?may I use accumulators for
>> that?
>> On 7 Sep 2015 17:47, "Fabian Hueske" <fhueske@gmail.com> wrote:
>>
>>> If the loop count of 3 is fixed (or not significantly larger), union
>>> should be fine.
>>>
>>> 2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:
>>>
>>>> Sorry the program has a union at   accumulated = accumulated.union(x.filter(t.f1
>>>> == 0))
>>>>
>>>> On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <fhueske@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Flavio,
>>>>>
>>>>> your example does not contain a union.
>>>>>
>>>>> Union itself basically comes for free. However, if you have a lot of
>>>>> small DataSet that you want to union, the plan can become very complex
and
>>>>> might cause overhead due to scheduling many small tasks. For example,
it is
>>>>> usually better to have one data source and input format that reads multiple
>>>>> small files instead of adding one data source for each tiny file and
apply
>>>>> union to all data sources to get all data.
>>>>>
>>>>> TL;DR; if your iteration count is only 3 as your example suggests you
>>>>> should be fine. If it exceeds say 32 it might be worth thinking about
your
>>>>> program.
>>>>>
>>>>> Cheers, Fabian
>>>>>
>>>>>
>>>>>
>>>>> 2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:
>>>>>
>>>>>> Hi Stephan,
>>>>>> thanks for the answer. Unfortunately I dind't understand if there's
>>>>>> an alternative to union right now..
>>>>>> My process is basically like this:
>>>>>>
>>>>>> Dataset x = ...
>>>>>> while(loopCnt < 3){
>>>>>>    x = x.join(y).where(0).equalTo(0).with());
>>>>>>    accumulated = x.filter(t.f1 == 0);
>>>>>>    x =  x.filter(t.f1!=0);
>>>>>>    loopCnt++;
>>>>>> }
>>>>>>
>>>>>> Best,
>>>>>> Flavio
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <sewen@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Union, like all operators, is lazy. When you call union, it only
>>>>>>> builds a "union stream", that unions when you execute the task.
So nothing
>>>>>>> is added before you call "env.execute()"
>>>>>>>
>>>>>>> After you call "env.execute()" and then union again, you will
>>>>>>> re-execute the entire history of computation to compute the data
set that
>>>>>>> you union with. Hence, for incremental computations, union()
is probably
>>>>>>> not a good choice, unless you persist intermediate data (seamless
support
>>>>>>> for that is WIP).
>>>>>>>
>>>>>>> Stephan
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <
>>>>>>> pompermaier@okkam.it> wrote:
>>>>>>>
>>>>>>>> Hi to all,
>>>>>>>> I have a job where I have to incrementally add Tuples to
a dataset
>>>>>>>> (in a while loop).
>>>>>>>> Is union() the best operator for this task or is there a
more
>>>>>>>> performant operator for this task?
>>>>>>>> Does union affect the read of already existing elements or
it just
>>>>>>>> appends the new ones somewhere?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Flavio
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>

Mime
View raw message