hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Atish Kathpal <atish.kath...@gmail.com>
Subject How to list the order in which file splits will be processed by Maps in Hadoop 2.2.0?
Date Wed, 19 Feb 2014 09:24:59 GMT
Hello

I am interested to know the order in which input files will be processed by
the map tasks of a given job.

*Example*: I am running Wordcount on input directory /ebooks/ containing
say 10 .txt files
On running the above job I would like to know at any point of time, what
map tasks (mad tasks ids) on which nodes (ip address), were processing
which file splits (actual file, range of offsets).

Is it possible to hook into MR source code to obtain such details ? Please
point me to the section of code I can get these details from?

Based on logging and analyzing above details I might want to perform some
pre-fetching to improve Map tasks performance. (I am not using HDFS, but a
different FS which needs some performance fixing using pre-fetching or
other techniques).

TL;DR
I want to be able to know the sequence/order in which different files will
be accessed by map tasks for processing once a job is submitted to Hadoop
v2 cluster. I am assuming some kind of FIFO scheduler module might be able
to give me this information at file level?

Looking forward to your reply.

Thanks.

Mime
View raw message