Your concern is correct: If your input is a list of files, rather than
the files themselves, then the tasks would not be data-local - since
the task input would just be the list of files, and the files' data
may reside on any node/rack of the cluster.
However, your job will still run as the HDFS reads do remote reads
transparently without developer intervention and all will still work
as you've written it to. If a block is found local to the DN, it is
read locally as well - all of this is automatic.
Are your input lists big (for each compressed output)? And is the list
arbitrary or a defined list per goal?
On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui <firstname.lastname@example.org> wrote:
> Hi hadoop users,
> I'm trying to find out if computation migration is something the developer
> needs to worry about or if it's supposed to be hidden.
> I would like to use hadoop to take in a list of image paths in the hdfs and
> then have each task compress these large, raw images into something much
> smaller - say jpeg files.
> Input: list of paths
> Output: compressed jpeg
> Since I don't really need a reduce task (I'm more using hadoop for its
> reliability and orchestration aspects), my mapper ought to just take the
> list of image paths and then work on them. As I understand it, each image
> will likely be on multiple data nodes.
> My question is how will each mapper task "migrate the computation" to the
> data nodes? I recall reading that the namenode is supposed to deal with
> this. Is it hidden from the developer? Or as the developer, do I need to
> discover where the data lies and then migrate the task to that node? Since
> my input is just a list of paths, it seems like the namenode couldn't really
> do this for me.
> Another question: Where can I find out more about this? I've looked up
> "rack awareness" and "computation migration" but haven't really found much
> code relating to either one - leading me to believe I'm not supposed to have
> to write code to deal with this.
> Anyway, could someone please help me out or set me straight on this?