hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "FAQ" by Arun C Murthy
Date Sat, 22 Sep 2007 07:25:14 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by Arun C Murthy:

The comment on the change is:
Added a section on how to create/write-to side-files via map/reduce tasks

  The distributed cache is used to distribute large read-only files that are needed by map/reduce
jobs to the cluster. The framework will copy the necessary files from a url (either hdfs:
or http:) on to the slave node before any tasks for the job are executed on that node. The
files are only copied once per job and so should not be modified by the application.
+ == 9. Can I write create/write-to hdfs files directly from my map/reduce tasks? ==
+ Yes. (Clearly, you want this since you need to create/write-to files other than the output-file
written out by [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/OutputCollector.html
+ Caveats:
+ <glossary>
+ ${mapred.output.dir} is the eventual output directory for the job ([http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)
JobConf.setOutputPath] / [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/JobConf.html#getOutputPath()
+ ${taskid} is the actual id of the individual task-attempt (e.g. task_200709221812_0001_m_000000_0),
a TIP is a bunch of ${taskid}s (e.g. task_200709221812_0001_m_000000).
+ </glossary>
+ With ''speculative-execution'' '''on''', one could face issues with 2 instances of the same
TIP (running simultaneously) trying to open/write-to the same file (path) on hdfs. Hence the
app-writer will have to pick unique names (e.g. using the complete taskid i.e. task_200709221812_0001_m_000000_0)
per task-attempt, not just per TIP. (Clearly, this needs to be done even if the user doesn't
create/write-to files directly via reduce tasks.)
+ To get around this the framework helps the application-writer out by maintaining a special
'''${mapred.output.dir}/_${taskid}''' sub-dir for each task-attempt on hdfs where the output
of the reduce task-attempt goes. On successful completion of the task-attempt the files in
the ${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved to ${mapred.output.dir}.
Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is
completely transparent to the application.
+ The app-writer can take advantage of this by creating any side-files required in ${mapred.output.dir}
during execution of his reduce-task, and the framework will move them out similarly - thus
you don't have to pick unique paths per task-attempt.
+ Fine-print: the value of ${mapred.output.dir} during execution of a particular task-attempt
is actually ${mapred.output.dir}/_{$taskid}, not the value set by [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)
JobConf.setOutputPath]. ''So, just create any hdfs files you want in ${mapred.output.dir}
from your reduce task to take advantage of this feature.''
+ The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since
output of the map, in that case, goes directly to hdfs.

View raw message