hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <qwertyman...@gmail.com>
Subject Re: Slow final few reducers
Date Sat, 11 Dec 2010 12:26:02 GMT
On Sat, Dec 11, 2010 at 5:25 PM, Rob Stewart
<robstewart57@googlemail.com> wrote:
> Oh,
>
> I should add, of the Java processes running on the remaining nodes for
> the final wave of reducers, the one taking all the CPU is the "Child"
> process (not TaskTracker). I log into the Master, and also, the Java
> process taking all the CPU is "Child".
>
> Is this normal?

Yes, "Child" is the Task JVM.

>
> thanks,
> Rob
>
> On 11 December 2010 11:38, Rob Stewart <robstewart57@googlemail.com> wrote:
>> Hi, many thanks for your response.
>>
>> A few observations:
>> - I know that for a fact my key distribution is quite radically skewed
>> (some keys with *many* value, most keys with few).
>> - I have overlooked the fact that I need a partitioner. I suspect that
>> this will help dramatically.
>>
>> I realize that the number of partitions should equal the number of
>> reducers (e.g. 100).
>>
>> So if here are my <key>,<values> (where values is a count):
>> <the>,<500>
>> <a>,<1000>
>> <the cat>,<20>
>> <the cat sat on the mat>,<1>
>>
>> and I have 3 reducers, how do I make:
>> Reducer-1: <the>
>> Reducer-2: <a>
>> Reducer-3: <the cat> & <the cat sat on the mat>
>>
>>
>> thanks,
>>
>> Rob
>>
>> On 11 December 2010 11:12, Harsh J <qwertymaniac@gmail.com> wrote:
>>> Hi,
>>>
>>> Certain reducers may receive a higher share of data than others
>>> (Depending on your data/key distribution, the partition function,
>>> etc.). Compare the longer reduce tasks' counters with the quicker
>>> ones.
>>>
>>> Are you sure that the reducers that take long are definitely the last
>>> wave, as in with IDs of 180-200 (and not a random bunch of reduce
>>> tasks taking longer)?
>>>
>>> Also take a look at the logs, and the machines that run these
>>> particular reducers -- ensure nothing is wrong on them.
>>>
>>> There's nothing specifically written in Hadoop for the "last wave" of
>>> Reduce tasks to take longer. Each reducer writes to its own file, and
>>> is completely independent.
>>>
>>> --
>>> Harsh J
>>> www.harshj.com
>>>
>>
>



-- 
Harsh J
www.harshj.com

Mime
View raw message