hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: Force task location on input split location?
Date Sun, 09 Dec 2012 01:49:01 GMT
Ok. Thanks for the clarification. It's to run an HBase job, so it will
be one node restriction for me.

JM

2012/12/8, Harsh J <harsh@cloudera.com>:
> In case of HBase, the locality is bound to be restricted to one node
> (the node hosting the region asked for). Otherwise, replication
> affects locality (N options).
>
> On Sat, Dec 8, 2012 at 11:27 PM, Jean-Marc Spaggiari
> <jean-marc@spaggiari.org> wrote:
>> Hi Harsh,
>>
>> Thanks for your help.
>>
>> mapred.fairscheduler.locality.delay seems to be working very well for
>> me. I have set it with 60s and JoInProgress picked up only "Choosing
>> data-local task"... It seems to do the job for my usecase. And as you
>> are saying, if I'm loosing a node while the job is running, the task
>> will still run after 60 seconds on another node
>>
>> I have not yet looked at CapacityScheduler, but will most probably later.
>>
>> One last thing. I have a replication factor set to 3. Does it mean 3
>> TaskTrackers might be able to take any of the tasks and run them
>> locally? Or only 1?
>>
>> Thanks,
>>
>> JM
>>
>> 2012/12/8, Harsh J <harsh@cloudera.com>:
>>> Answer depends on a couple of features to be present in your version
>>> of Hadoop, and is inline.
>>>
>>> On Fri, Dec 7, 2012 at 11:38 PM, Jean-Marc Spaggiari
>>> <jean-marc@spaggiari.org> wrote:
>>>> Hi,
>>>>
>>>> Is there a way for force the tasks from a MR job to run ONLY on the
>>>> taskservers where the input split location is?
>>>
>>> There is no severely strict version to do this, but there are
>>> improvements you could make to configuration to make conditions more
>>> favorable to have data local tasks.
>>>
>>>> I mean, on the taskdetails UI, I can see all my tasks (25), and some
>>>> of them have Machine == Input split Location. But some don't.
>>>
>>> It is sometimes normal to see non-data-local tasks among mostly
>>> data-local tasks in MR - this is due to availability of
>>> slots/resources during job scheduling.
>>>
>>>> So I'm wondering if there is a way to force hadoop to run those tasks
>>>> "locally" or else discard them and wait for the local server to be
>>>> able to run them?
>>>
>>> You need a good scheduler that can address your needs.
>>>
>>> For FairScheduler, in 1.x or so, you can utilize
>>> mapred.fairscheduler.locality.delay, set in milliseconds in your
>>> mapred-site.xml, to indicate the maximum period of wait for a task to
>>> get scheduled with demanded locality. Ideally you'd want to set this
>>> to a period slightly greater than the average time between two
>>> heartbeats from a single tasktracker to the jobtracker. The 2.x one
>>> does it automatically, seems like.
>>>
>>> For CapacityScheduler, there isn't any form of delay factor in 1.x
>>> releases. In 2.x however, CapacityScheduler has the
>>> yarn.scheduler.capacity.node-locality-delay config property that can
>>> be set for a similar effect.
>>>
>>> Note that the reason MR does not do absolutely strict scheduling is
>>> for many reasons, one of them also being to counter failure or
>>> unavailability of the target node for an assumed infinite period. Most
>>> users would not prefer their tasks to hang in wait forever due to any
>>> of such situations, and a few non-data local tasks in the job don't
>>> hurt the overall execution time too much.
>>>
>>> --
>>> Harsh J
>>>
>
>
>
> --
> Harsh J
>

Mime
View raw message