flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pieter Hameete <phame...@gmail.com>
Subject Re: Imbalanced workload between workers
Date Thu, 28 Jan 2016 09:17:33 GMT
Hi Stephen, Till,

I've watched the Job again and please see the log of the CoGroup operator:

[image: Inline afbeelding 1]

All workers get to process a fairly distributed amount of bytes and
records, BUT hadoop-w-0, hadoop-w-2 and hadoop-w-3 don't start working
until hadoop-w-1 is finished. Is this behavior to be expected with a
CoGroup or could there still be something wrong in the distrubtion of the
data?

Kind regards,

Pieter

2016-01-27 21:48 GMT+01:00 Stephan Ewen <sewen@apache.org>:

> Hi Pieter!
>
> Interesting, but good :-)
>
> I don't think we did much on the hash functions since 0.9.1. I am a bit
> surprised that it made such a difference. Well, as long as it improves with
> the newer version :-)
>
> Greetings,
> Stephan
>
>
> On Wed, Jan 27, 2016 at 9:42 PM, Pieter Hameete <phameete@gmail.com>
> wrote:
>
>> Hi Till,
>>
>> i've upgraded to Flink 0.10.1 and ran the job again without any changes
>> to the code to see the bytes input and output of the operators and for the
>> different workers.To my surprise it is very well balanced between all
>> workers and because of this the job completed much faster.
>>
>> Are there any changes/fixes between Flink 0.9.1 and 0.10.1 that could
>> cause this to be better for me now?
>>
>> Thanks,
>>
>> Pieter
>>
>> 2016-01-27 14:10 GMT+01:00 Pieter Hameete <phameete@gmail.com>:
>>
>>>
>>> Cheers for the quick reply Till.
>>>
>>> That would be very useful information to have! I'll upgrade my project
>>> to Flink 0.10.1 tongiht and let you know if I can find out if theres a skew
>>> in the data :-)
>>>
>>> - Pieter
>>>
>>>
>>> 2016-01-27 13:49 GMT+01:00 Till Rohrmann <trohrmann@apache.org>:
>>>
>>>> Could it be that your data is skewed? This could lead to different
>>>> loads on different task managers.
>>>>
>>>> With the latest Flink version, the web interface should show you how
>>>> many bytes each operator has written and received. There you could see if
>>>> one operator receives more elements than the others.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Wed, Jan 27, 2016 at 1:35 PM, Pieter Hameete <phameete@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> Currently I am running a job in the GCloud in a configuration with 4
>>>>> task managers that each have 4 CPUs (for a total parallelism of 16).
>>>>>
>>>>> However, I noticed my job is running much slower than expected and
>>>>> after some more investigation I found that one of the workers is doing
a
>>>>> majority of the work (its CPU load was at 100% while the others were
almost
>>>>> idle).
>>>>>
>>>>> My job execution plan can be found here:
>>>>> http://i.imgur.com/fHKhVFf.png
>>>>>
>>>>> The input is split into multiple files so loading the data is properly
>>>>> distributed over the workers.
>>>>>
>>>>> I am wondering if you can provide me with some tips on how to figure
>>>>> out what is going wrong here:
>>>>>
>>>>>    - Could this imbalance in workload be the result of an imbalance
>>>>>    in the hash paritioning?
>>>>>    - Is there a convenient way to see how many elements each worker
>>>>>       gets to process? Would it work to write the output of the CoGroup
to disk
>>>>>       because each worker writes to its own output file and investigate
the
>>>>>       differences?
>>>>>    - Is there something strange about the execution plan that could
>>>>>    cause this?
>>>>>
>>>>> Thanks and kind regards,
>>>>>
>>>>> Pieter
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message