flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pieter Hameete <phame...@gmail.com>
Subject Re: Imbalanced workload between workers
Date Wed, 27 Jan 2016 20:42:56 GMT
Hi Till,

i've upgraded to Flink 0.10.1 and ran the job again without any changes to
the code to see the bytes input and output of the operators and for the
different workers.To my surprise it is very well balanced between all
workers and because of this the job completed much faster.

Are there any changes/fixes between Flink 0.9.1 and 0.10.1 that could cause
this to be better for me now?



2016-01-27 14:10 GMT+01:00 Pieter Hameete <phameete@gmail.com>:

> Cheers for the quick reply Till.
> That would be very useful information to have! I'll upgrade my project to
> Flink 0.10.1 tongiht and let you know if I can find out if theres a skew in
> the data :-)
> - Pieter
> 2016-01-27 13:49 GMT+01:00 Till Rohrmann <trohrmann@apache.org>:
>> Could it be that your data is skewed? This could lead to different loads
>> on different task managers.
>> With the latest Flink version, the web interface should show you how many
>> bytes each operator has written and received. There you could see if one
>> operator receives more elements than the others.
>> Cheers,
>> Till
>> On Wed, Jan 27, 2016 at 1:35 PM, Pieter Hameete <phameete@gmail.com>
>> wrote:
>>> Hi guys,
>>> Currently I am running a job in the GCloud in a configuration with 4
>>> task managers that each have 4 CPUs (for a total parallelism of 16).
>>> However, I noticed my job is running much slower than expected and after
>>> some more investigation I found that one of the workers is doing a majority
>>> of the work (its CPU load was at 100% while the others were almost idle).
>>> My job execution plan can be found here: http://i.imgur.com/fHKhVFf.png
>>> The input is split into multiple files so loading the data is properly
>>> distributed over the workers.
>>> I am wondering if you can provide me with some tips on how to figure out
>>> what is going wrong here:
>>>    - Could this imbalance in workload be the result of an imbalance in
>>>    the hash paritioning?
>>>    - Is there a convenient way to see how many elements each worker
>>>       gets to process? Would it work to write the output of the CoGroup to disk
>>>       because each worker writes to its own output file and investigate the
>>>       differences?
>>>    - Is there something strange about the execution plan that could
>>>    cause this?
>>> Thanks and kind regards,
>>> Pieter

View raw message