hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: Assigning reduce tasks to specific nodes
Date Sat, 08 Dec 2012 13:18:36 GMT
Hi Tsuyoshi,

For which version of Hadoop is that? I think it's for 0.2x.x, right?
Because I'm not able to find this class in 1.0.x

Thanks,

JM

2012/12/8, Tsuyoshi OZAWA <ozawa.tsuyoshi@gmail.com>:
> Hi Hioryuki,
>
> Lately I've changed scheduler for improving hadoop, so I may help you.
>
> RMContainerAllocator#handleEvent decides MapTasks to allocated containers.
>   You can implement semi-strict(best effort allocation) mode by hacking
> there. Note that, however, allocation of containers is done
> by ResourceManager. MRAppMaster can not control where to allocate
> containers, but where to allocate MapTasks.
>
> If you have any question, please ask me.
>
> Thanks,
> Tsuyoshi
>
>
> On Sat, Dec 8, 2012 at 4:51 AM, Jean-Marc Spaggiari
> <jean-marc@spaggiari.org
>> wrote:
>
>> Hi Hiroyuki,
>>
>> Have you made any progress on that?
>>
>> I'm also looking at a way to assign specific Map tasks to specific
>> nodes (I want the Map to run where the data is).
>>
>> JM
>>
>> 2012/12/1, Michael Segel <michael_segel@hotmail.com>:
>> > I haven't thought about reducers but in terms of mappers you need to
>> > override the data locality so that it thinks that the node where you
>> want to
>> > send the data exists.
>> > Again, not really recommended since it will kill performance unless the
>> > compute time is at least an order of magnitude greater than the time it
>> > takes to transfer the data.
>> >
>> > Really, really don't recommend it....
>> >
>> > We did it as a hack, just to see if we could do it and get better
>> > overall
>> > performance for a specific job.
>> >
>> >
>> > On Dec 1, 2012, at 6:27 AM, Harsh J <harsh@cloudera.com> wrote:
>> >
>> >> Yes, scheduling is done on a Tasktracker heartbeat basis, so it is
>> >> certainly possible to do absolutely strict scheduling (although be
>> >> aware of the condition of failing/unavailable tasktrackers).
>> >>
>> >> Mohit's suggestion is somewhat like what you desire (delay scheduling
>> >> in fair scheduler config) - but setting it to very high values is bad
>> >> to do (for jobs that don't need this).
>> >>
>> >> On Sat, Dec 1, 2012 at 4:11 PM, Hiroyuki Yamada <mogwaing@gmail.com>
>> >> wrote:
>> >>> Thank you all for the comments.
>> >>>
>> >>>> you ought to make sure your scheduler also does non-strict
>> >>>> scheduling
>> of
>> >>>> data local tasks for jobs
>> >>> that don't require such strictness
>> >>>
>> >>> I just want to make sure one thing.
>> >>> If I write my own scheduler, is it possible to do "strict" scheduling
>> >>> ?
>> >>>
>> >>> Thanks
>> >>>
>> >>> On Thu, Nov 29, 2012 at 1:56 PM, Mohit Anchlia
>> >>> <mohitanchlia@gmail.com
>> >
>> >>> wrote:
>> >>>> Look at locality delay parameter
>> >>>>
>> >>>> Sent from my iPhone
>> >>>>
>> >>>> On Nov 28, 2012, at 8:44 PM, Harsh J <harsh@cloudera.com>
wrote:
>> >>>>
>> >>>>> None of the current schedulers are "strict" in the sense of
"do not
>> >>>>> schedule the task if such a tasktracker is not available". That
has
>> >>>>> never been a requirement for Map/Reduce programs and nor should
be.
>> >>>>>
>> >>>>> I feel if you want some code to run individually on all nodes
for
>> >>>>> whatever reason, you may as well ssh into each one and start
it
>> >>>>> manually with appropriate host-based parameters, etc.. and then
>> >>>>> aggregate their results.
>> >>>>>
>> >>>>> Note that even if you get down to writing a scheduler for this
>> >>>>> (which
>> >>>>> I don't think is a good idea, but anyway), you ought to make
sure
>> your
>> >>>>> scheduler also does non-strict scheduling of data local tasks
for
>> jobs
>> >>>>> that don't require such strictness - in order for them to complete
>> >>>>> quickly than wait around for scheduling in a fixed manner.
>> >>>>>
>> >>>>> On Thu, Nov 29, 2012 at 6:00 AM, Hiroyuki Yamada
>> >>>>> <mogwaing@gmail.com
>> >
>> >>>>> wrote:
>> >>>>>> Thank you all for the comments and advices.
>> >>>>>>
>> >>>>>> I know it is not recommended to assigning mapper locations
by
>> myself.
>> >>>>>> But There needs to be one mapper running in each node in
some
>> >>>>>> cases,
>> >>>>>> so I need a strict way to do it.
>> >>>>>>
>> >>>>>> So, locations is taken care of by JobTracker(scheduler),
but it is
>> not
>> >>>>>> strict.
>> >>>>>> And, the only way to do it strictly is making a own scheduler,
>> >>>>>> right
>> >>>>>> ?
>> >>>>>>
>> >>>>>> I have checked the source and I am not sure where to modify
to do
>> it.
>> >>>>>> What I understand is FairScheduler and others are for scheduling
>> >>>>>> multiple jobs. Is this right ?
>> >>>>>> What I want to do is scheduling tasks in one job.
>> >>>>>> This can be achieved by FairScheduler and others ?
>> >>>>>>
>> >>>>>> Regards,
>> >>>>>> Hiroyuki
>> >>>>>>
>> >>>>>> On Thu, Nov 29, 2012 at 12:46 AM, Michael Segel
>> >>>>>> <michael_segel@hotmail.com> wrote:
>> >>>>>>> Mappers? Uhm... yes you can do it.
>> >>>>>>> Yes it is non-trivial.
>> >>>>>>> Yes, it is not recommended.
>> >>>>>>>
>> >>>>>>> I think we talk a bit about this in an InfoQ article
written by
>> >>>>>>> Boris
>> >>>>>>> Lublinsky.
>> >>>>>>>
>> >>>>>>> Its kind of wild when your entire cluster map goes red
in
>> ganglia...
>> >>>>>>> :-)
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Nov 28, 2012, at 2:41 AM, Harsh J <harsh@cloudera.com>
wrote:
>> >>>>>>>
>> >>>>>>> Hi,
>> >>>>>>>
>> >>>>>>> Mapper scheduling is indeed influenced by the getLocations()
>> >>>>>>> returned
>> >>>>>>> results of the InputSplit.
>> >>>>>>>
>> >>>>>>> The map task itself does not care about deserializing
the
>> >>>>>>> location
>> >>>>>>> information, as it is of no use to it. The location
information
>> >>>>>>> is
>> >>>>>>> vital to
>> >>>>>>> the scheduler (or in 0.20.2, the JobTracker), where
it is sent to
>> >>>>>>> directly
>> >>>>>>> when a job is submitted. The locations are used pretty
well here.
>> >>>>>>>
>> >>>>>>> You should be able to control (or rather, influence)
mapper
>> placement
>> >>>>>>> by
>> >>>>>>> working with the InputSplits, but not strictly so, cause
in the
>> >>>>>>> end
>> >>>>>>> its up
>> >>>>>>> to your MR scheduler to do data local or non data local
>> assignments.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Wed, Nov 28, 2012 at 11:39 AM, Hiroyuki Yamada
>> >>>>>>> <mogwaing@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>> Hi Harsh,
>> >>>>>>>>
>> >>>>>>>> Thank you for the information.
>> >>>>>>>> I understand the current circumstances.
>> >>>>>>>>
>> >>>>>>>> How about for mappers ?
>> >>>>>>>> As far as I tested, location information in InputSplit
is
>> >>>>>>>> ignored
>> >>>>>>>> in
>> >>>>>>>> 0.20.2,
>> >>>>>>>> so there seems no easy way for assigning mappers
to specific
>> nodes.
>> >>>>>>>> (I before checked the source and noticed that
>> >>>>>>>> location information is not restored when deserializing
the
>> >>>>>>>> InputSplit
>> >>>>>>>> instance.)
>> >>>>>>>>
>> >>>>>>>> Thanks,
>> >>>>>>>> Hiroyuki
>> >>>>>>>>
>> >>>>>>>> On Wed, Nov 28, 2012 at 2:08 PM, Harsh J <harsh@cloudera.com>
>> >>>>>>>> wrote:
>> >>>>>>>>> This is not supported/available currently even
in MR2, but take
>> >>>>>>>>> a
>> >>>>>>>>> look
>> >>>>>>>>> at
>> >>>>>>>>> https://issues.apache.org/jira/browse/MAPREDUCE-199.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On Wed, Nov 28, 2012 at 9:34 AM, Hiroyuki Yamada
>> >>>>>>>>> <mogwaing@gmail.com>
>> >>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Hi,
>> >>>>>>>>>>
>> >>>>>>>>>> I am wondering how I can assign reduce tasks
to specific
>> >>>>>>>>>> nodes.
>> >>>>>>>>>> What I want to do is, for example,  assigning
reducer which
>> >>>>>>>>>> produces
>> >>>>>>>>>> part-00000 to node xxx000,
>> >>>>>>>>>> and part-00001 to node xxx001 and so on.
>> >>>>>>>>>>
>> >>>>>>>>>> I think it's abount task assignment scheduling
but
>> >>>>>>>>>> I am not sure where to customize to achieve
this.
>> >>>>>>>>>> Is this done by writing some extensions
?
>> >>>>>>>>>> or any easier way to do this ?
>> >>>>>>>>>>
>> >>>>>>>>>> Regards,
>> >>>>>>>>>> Hiroyuki
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> Harsh J
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Harsh J
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Harsh J
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >>
>> >
>> >
>>
>

Mime
View raw message