hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jin keyao <keyao...@gmail.com>
Subject Re: question about scheduler rack awareness
Date Tue, 19 Mar 2013 14:13:05 GMT
hi Jens,

thanks for your quick reply.

actually I am afraid I didn't describe my question very well.  Usually for
example, we make 1 Block as an input, such as we are benchmarking Terasort,
which is good for scheduling the map on the node contain the data.

But in a typical scenario I am working on, I want to make the sequential
blocks as an input, if I take the original example, what I want to do is to
group 100 blocks into 10 inputs, each input contain 10 sequential blocks to
be handled by map tasks

each 10 blocks may distribute on several nodes, based on the block
placement policy in Hadoop.  so in this case, will the map task be
scheduled on the node which have a nearest way to access those 10 blocks?

or will hadoop only make sure the first block it's going to access is a
local one?


2013/3/19 Jens Scheidtmann <jens.scheidtmann@gmail.com>

> Dear Jin,
>
> you wrote:
> > my question is : will the map task created on the node which access his
> 10 blocks most fastest ?
>
> hadoop tries hard to run the map tasks on the node, where the data is
> stored. "Hadoop: The Definitive Guide" has some UML Sequence diagrams on
> what happens for creation of map jvms. Sorry, I was not able to relocate
> them on the web, yet (well, safaribooksonline.com ;-).
>
> Depending on the specific data layout (e.g. record lengths), the map tasks
> may need to read other blocks anyway, which may be off-node.
>
> On how many nodes is your 100 blocks file stored? on 10?
>
> If it is on one node, then you're likely running into map slot limits or
> container limits.
>
> Best regards,
>
> Jens
>

Mime
View raw message