hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohit Kochar <mnit.ro...@gmail.com>
Subject Re: basic question about rack awareness and computation migration
Date Tue, 05 Mar 2013 19:15:21 GMT
Hello ,
To be precise this is hidden from the developer and you need not write any code for this.
Whenever any file is stored in HDFS than it is splitted into block size of configured size
and each block could potentially be stored on different datanode.All this information of which
file contains which blocks resides with the namenode.

So essentially whenever a file is accessed via DFS Client it requests the  NameNode for metadata,
which DFS client uses to provide the file in streaming fashion to enduser.

Since namenode knows the location of all the blocks/files ,a task can be scheduled by hadoop
to be executed on the same node which is having data.

Rohit Kochar

On 05-Mar-2013, at 5:19 PM, Julian Bui wrote:

> Hi hadoop users,
> I'm trying to find out if computation migration is something the developer needs to worry
about or if it's supposed to be hidden.
> I would like to use hadoop to take in a list of image paths in the hdfs and then have
each task compress these large, raw images into something much smaller - say jpeg  files.
> Input: list of paths
> Output: compressed jpeg
> Since I don't really need a reduce task (I'm more using hadoop for its reliability and
orchestration aspects), my mapper ought to just take the list of image paths and then work
on them.  As I understand it, each image will likely be on multiple data nodes.  
> My question is how will each mapper task "migrate the computation" to the data nodes?
 I recall reading that the namenode is supposed to deal with this.  Is it hidden from the
developer?  Or as the developer, do I need to discover where the data lies and then migrate
the task to that node?  Since my input is just a list of paths, it seems like the namenode
couldn't really do this for me.
> Another question: Where can I find out more about this?  I've looked up "rack awareness"
and "computation migration" but haven't really found much code relating to either one - leading
me to believe I'm not supposed to have to write code to deal with this.
> Anyway, could someone please help me out or set me straight on this?
> Thanks,
> -Julian

View raw message