hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abhay Ratnaparkhi <abhay.ratnapar...@gmail.com>
Subject Re: knowing the nodes on which reduce tasks will run
Date Mon, 03 Sep 2012 16:00:54 GMT
All of my map tasks are about to complete and there is not much processing
to be done in reducer.
The job is running from a week so I don't want the job to fail. Any other
suggestion to tackle this is welcome.

~Abhay

On Mon, Sep 3, 2012 at 9:26 PM, Hemanth Yamijala
<yhemanth@thoughtworks.com>wrote:

> Hi,
>
> You are right that a change to mapred.tasktracker.reduce.tasks.maximum
> will require a restart of the tasktrackers. AFAIK, there is no way of
> modifying this property without restarting.
>
> On a different note, could you see if the amount of intermediate data can
> be reduced using a combiner, or some other form of local aggregation ?
>
> Thanks
> hemanth
>
>
> On Mon, Sep 3, 2012 at 9:06 PM, Abhay Ratnaparkhi <
> abhay.ratnaparkhi@gmail.com> wrote:
>
>> How can I set  'mapred.tasktracker.reduce.tasks.maximum'  to "0" in a
>> running tasktracker?
>> Seems that I need to restart the tasktracker and in that case I'll loose
>> the output of map tasks by particular tasktracker.
>>
>> Can I change   'mapred.tasktracker.reduce.tasks.maximum'  to "0"  without
>> restarting tasktracker?
>>
>> ~Abhay
>>
>>
>> On Mon, Sep 3, 2012 at 8:53 PM, Bejoy Ks <bejoy.hadoop@gmail.com> wrote:
>>
>>> HI Abhay
>>>
>>> The TaskTrackers on which the reduce tasks are triggered is chosen in
>>> random based on the reduce slot availability. So if you don't need the
>>> reduce tasks to be scheduled on some particular nodes you need to set
>>> 'mapred.tasktracker.reduce.tasks.maximum' on those nodes to 0. The
>>> bottleneck here is that this property is not a job level one you need to
>>> set it on a cluster level.
>>>
>>> A cleaner approach will be to configure each of your nodes with the
>>> right number of map and reduce slots based on the resources available on
>>> each machine.
>>>
>>>
>>> On Mon, Sep 3, 2012 at 7:49 PM, Abhay Ratnaparkhi <
>>> abhay.ratnaparkhi@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> How can one get to know the nodes on which reduce tasks will run?
>>>>
>>>> One of my job is running and it's completing all the map tasks.
>>>> My map tasks write lots of intermediate data. The intermediate
>>>> directory is getting full on all the nodes.
>>>> If the reduce task take any node from cluster then It'll try to copy
>>>> the data to same disk and it'll eventually fail due to Disk space related
>>>> exceptions.
>>>>
>>>> I have added few more tasktracker nodes in the cluster and now want to
>>>> run reducer on new nodes only.
>>>> Is it possible to choose a node on which the reducer will run? What's
>>>> the algorithm hadoop uses to get a new node to run reducer?
>>>>
>>>> Thanks in advance.
>>>>
>>>> Bye
>>>> Abhay
>>>>
>>>
>>>
>>
>

Mime
View raw message