From d...@apache.org
Subject svn commit: r666624 [1/3] - in /hadoop/core/branches/branch-0.18: CHANGES.txt docs/mapred_tutorial.html docs/mapred_tutorial.pdf src/docs/src/documentation/content/xdocs/mapred_tutorial.xml src/docs/src/documentation/content/xdocs/site.xml
Date Wed, 11 Jun 2008 11:39:56 GMT
Author: ddas
Date: Wed Jun 11 04:39:55 2008
New Revision: 666624

URL: http://svn.apache.org/viewvc?rev=666624&view=rev
Merge -r 666619:666620 from trunk onto 0.18 branch. Fixes HADOOP-3096.


Modified: hadoop/core/branches/branch-0.18/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.18/CHANGES.txt?rev=666624&r1=666623&r2=666624&view=diff
--- hadoop/core/branches/branch-0.18/CHANGES.txt (original)
+++ hadoop/core/branches/branch-0.18/CHANGES.txt Wed Jun 11 04:39:55 2008
@@ -282,6 +282,9 @@
     HADOOP-3379. Documents stream.non.zero.exit.status.is.failure for Streaming.
     (Amareshwari Sriramadasu via ddas)
+    HADOOP-3096. Improves documentation about the Task Execution Environment in 
+    the Map-Reduce tutorial. (Amareshwari Sriramadasu via ddas)
     HADOOP-3274. The default constructor of BytesWritable creates empty 

Modified: hadoop/core/branches/branch-0.18/docs/mapred_tutorial.html
URL: http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.18/docs/mapred_tutorial.html?rev=666624&r1=666623&r2=666624&view=diff
--- hadoop/core/branches/branch-0.18/docs/mapred_tutorial.html (original)
+++ hadoop/core/branches/branch-0.18/docs/mapred_tutorial.html Wed Jun 11 04:39:55 2008
@@ -301,7 +301,7 @@
 <a href="#Example%3A+WordCount+v2.0">Example: WordCount v2.0</a>
 <ul class="minitoc">
-<a href="#Source+Code-N10C87">Source Code</a>
+<a href="#Source+Code-N10D77">Source Code</a>
 <a href="#Sample+Runs">Sample Runs</a>
@@ -1542,42 +1542,170 @@
 <p>Users/admins can also specify the maximum virtual memory 
         of the launched child-task using <span class="codefrag">mapred.child.ulimit</span>.</p>
-<p>When the job starts, the localized job directory
-        <span class="codefrag"> ${mapred.local.dir}/taskTracker/jobcache/$jobid/</span>
-        has the following directories: </p>
+<p>The task tracker has local directory,
+        <span class="codefrag"> ${mapred.local.dir}/taskTracker/</span> to create
+        cache and localized job. It can define multiple local directories 
+        (spanning multiple disks) and then each filename is assigned to a
+        semi-random local directory. When the job starts, task tracker 
+        creates a localized job directory relative to the local directory
+        specified in the configuration. Thus the task tracker directory 
+        structure looks the following: </p>
-<li> A job-specific shared directory, created at location
-        <span class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/work/
-        This directory is exposed to the users through 
-        <span class="codefrag">job.local.dir </span>. The tasks can use this
space as scratch
-        space and share files among them. The directory can accessed through 
-        api <a href="api/org/apache/hadoop/mapred/JobConf.html#getJobLocalDir()">
-        JobConf.getJobLocalDir()</a>. It is available as System property also.
-        So,users can call <span class="codefrag">System.getProperty("job.local.dir")</span>;
-        </li>
-<li>A jars directory, which has the job jar file and expanded jar </li>
+<span class="codefrag">${mapred.local.dir}/taskTracker/archive/</span> :
+        The distributed cache. This directory holds the localized distributed
+        cache. Thus localized distributed cache is shared among all
+        the tasks and jobs </li>
-<li>A job.xml file, the generic job configuration </li>
+<span class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/</span>
+        The localized job directory 
+        <ul>
-<li>Each task has directory <span class="codefrag">task-id</span> which
again has the 
-        following structure
+<span class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/work/</span>

+        : The job-specific shared directory. The tasks can use this space as 
+        scratch space and share files among them. This directory is exposed
+        to the users through the configuration property  
+        <span class="codefrag">job.local.dir</span>. The directory can accessed
+        api <a href="api/org/apache/hadoop/mapred/JobConf.html#getJobLocalDir()">
+        JobConf.getJobLocalDir()</a>. It is available as System property also.
+        So, users (streaming etc.) can call 
+        <span class="codefrag">System.getProperty("job.local.dir")</span> to
access the 
+        directory.</li>
+<span class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/jars/</span>
+        : The jars directory, which has the job jar file and expanded jar.
+        The <span class="codefrag">job.jar</span> is the application's jar file
that is
+        automatically distributed to each machine. It is expanded in jars
+        directory before the tasks for the job start. The job.jar location
+        is accessible to the application through the api
+        <a href="api/org/apache/hadoop/mapred/JobConf.html#getJar()"> 
+        JobConf.getJar() </a>. To access the unjarred directory,
+        JobConf.getJar().getParent() can be called.</li>
+<span class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/job.xml</span>
+        : The job.xml file, the generic job configuration, localized for 
+        the job. </li>
+<span class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid</span>
+        : The task direcrory for each task attempt. Each task directory
+        again has the following structure :
-<li>A job.xml file, task localized job configuration </li>
+<span class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid/job.xml</span>
+        : A job.xml file, task localized job configuration, Task localization
+        means that properties have been set that are specific to
+        this particular task within the job. The properties localized for 
+        each task are described below.</li>
+<span class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid/output</span>
+        : A directory for intermediate output files. This contains the
+        temporary map reduce data generated by the framework
+        such as map output files etc. </li>
+<span class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid/work</span>
+        : The curernt working directory of the task. </li>
+<span class="codefrag">${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid/work/tmp</span>
+        : The temporary directory for the task. 
+        (User can specify the property <span class="codefrag">mapred.child.tmp</span>
to set
+        the value of temporary directory for map and reduce tasks. This 
+        defaults to <span class="codefrag">./tmp</span>. If the value is not
an absolute path,
+        it is prepended with task's working directory. Otherwise, it is
+        directly assigned. The directory will be created if it doesn't exist.
+        Then, the child java tasks are executed with option
+        <span class="codefrag">-Djava.io.tmpdir='the absolute path of the tmp dir'</span>.
+        Anp pipes and streaming are set with environment variable,
+        <span class="codefrag">TMPDIR='the absolute path of the tmp dir'</span>).
+        directory is created, if <span class="codefrag">mapred.child.tmp</span>
has the value
+        <span class="codefrag">./tmp</span> 
-<li>A directory for intermediate output files</li>
-<li>The working directory of the task. 
-        And work directory has a temporary directory 
-        to create temporary files</li>
+<p>The following properties are localized in the job configuration 
+         for each task's execution: </p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+<th colspan="1" rowspan="1">Name</th><th colspan="1" rowspan="1">Type</th><th
colspan="1" rowspan="1">Description</th>
+<td colspan="1" rowspan="1">mapred.job.id</td><td colspan="1" rowspan="1">String</td><td
colspan="1" rowspan="1">The job id</td>
+<td colspan="1" rowspan="1">mapred.jar</td><td colspan="1" rowspan="1">String</td>
+              <td colspan="1" rowspan="1">job.jar location in job directory</td>
+<td colspan="1" rowspan="1">job.local.dir</td><td colspan="1" rowspan="1">
+              <td colspan="1" rowspan="1"> The job specific shared scratch space</td>
+<td colspan="1" rowspan="1">mapred.tip.id</td><td colspan="1" rowspan="1">
+              <td colspan="1" rowspan="1"> The task id</td>
+<td colspan="1" rowspan="1">mapred.task.id</td><td colspan="1" rowspan="1">
+              <td colspan="1" rowspan="1"> The task attempt id</td>
+<td colspan="1" rowspan="1">mapred.task.is.map</td><td colspan="1" rowspan="1">
boolean </td>
+              <td colspan="1" rowspan="1">Is this a map task</td>
+<td colspan="1" rowspan="1">mapred.task.partition</td><td colspan="1" rowspan="1">
int </td>
+              <td colspan="1" rowspan="1">The id of the task within the job</td>
+<td colspan="1" rowspan="1">map.input.file</td><td colspan="1" rowspan="1">
+              <td colspan="1" rowspan="1"> The filename that the map is reading from</td>
+<td colspan="1" rowspan="1">map.input.start</td><td colspan="1" rowspan="1">
+              <td colspan="1" rowspan="1"> The offset of the start of the map input
+<td colspan="1" rowspan="1">map.input.length </td><td colspan="1" rowspan="1">long
+              <td colspan="1" rowspan="1">The number of bytes in the map input split</td>
+<td colspan="1" rowspan="1">mapred.work.output.dir</td><td colspan="1" rowspan="1">
String </td>
+              <td colspan="1" rowspan="1">The task's temporary output directory</td>
+<p>The standard output (stdout) and error (stderr) streams of the task 
+        are read by the TaskTracker and logged to 
+        <span class="codefrag">${HADOOP_LOG_DIR}/userlogs</span>
 <p>The <a href="#DistributedCache">DistributedCache</a> can also be used
         as a rudimentary software distribution mechanism for use in the map 
         and/or reduce tasks. It can be used to distribute both jars and 
@@ -1597,7 +1725,7 @@
         loaded via <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/lang/System.html#loadLibrary(java.lang.String)">
         System.loadLibrary</a> or <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/lang/System.html#load(java.lang.String)">
-<a name="N108FB"></a><a name="Job+Submission+and+Monitoring"></a>
+<a name="N109EB"></a><a name="Job+Submission+and+Monitoring"></a>
 <h3 class="h4">Job Submission and Monitoring</h3>
 <a href="api/org/apache/hadoop/mapred/JobClient.html">
@@ -1658,7 +1786,7 @@
 <p>Normally the user creates the application, describes various facets 
         of the job via <span class="codefrag">JobConf</span>, and then uses the

         <span class="codefrag">JobClient</span> to submit the job and monitor
its progress.</p>
-<a name="N1095B"></a><a name="Job+Control"></a>
+<a name="N10A4B"></a><a name="Job+Control"></a>
 <h4>Job Control</h4>
 <p>Users may need to chain map-reduce jobs to accomplish complex
           tasks which cannot be done via a single map-reduce job. This is fairly
@@ -1694,7 +1822,7 @@
-<a name="N10985"></a><a name="Job+Input"></a>
+<a name="N10A75"></a><a name="Job+Input"></a>
 <h3 class="h4">Job Input</h3>
 <a href="api/org/apache/hadoop/mapred/InputFormat.html">
@@ -1742,7 +1870,7 @@
         appropriate <span class="codefrag">CompressionCodec</span>. However,
it must be noted that
         compressed files with the above extensions cannot be <em>split</em> and

         each compressed file is processed in its entirety by a single mapper.</p>
-<a name="N109EF"></a><a name="InputSplit"></a>
+<a name="N10ADF"></a><a name="InputSplit"></a>
 <a href="api/org/apache/hadoop/mapred/InputSplit.html">
@@ -1756,7 +1884,7 @@
           FileSplit</a> is the default <span class="codefrag">InputSplit</span>.
It sets 
           <span class="codefrag">map.input.file</span> to the path of the input
file for the
           logical split.</p>
-<a name="N10A14"></a><a name="RecordReader"></a>
+<a name="N10B04"></a><a name="RecordReader"></a>
 <a href="api/org/apache/hadoop/mapred/RecordReader.html">
@@ -1768,7 +1896,7 @@
           for processing. <span class="codefrag">RecordReader</span> thus assumes
           responsibility of processing record boundaries and presents the tasks 
           with keys and values.</p>
-<a name="N10A37"></a><a name="Job+Output"></a>
+<a name="N10B27"></a><a name="Job+Output"></a>
 <h3 class="h4">Job Output</h3>
 <a href="api/org/apache/hadoop/mapred/OutputFormat.html">
@@ -1793,7 +1921,7 @@
 <span class="codefrag">TextOutputFormat</span> is the default 
         <span class="codefrag">OutputFormat</span>.</p>
-<a name="N10A60"></a><a name="Task+Side-Effect+Files"></a>
+<a name="N10B50"></a><a name="Task+Side-Effect+Files"></a>
 <h4>Task Side-Effect Files</h4>
 <p>In some applications, component tasks need to create and/or write to
           side-files, which differ from the actual job-output files.</p>
@@ -1832,7 +1960,7 @@
 <p>The entire discussion holds true for maps of jobs with 
            reducer=NONE (i.e. 0 reduces) since output of the map, in that case, 
            goes directly to HDFS.</p>
-<a name="N10AA8"></a><a name="RecordWriter"></a>
+<a name="N10B98"></a><a name="RecordWriter"></a>
 <a href="api/org/apache/hadoop/mapred/RecordWriter.html">
@@ -1840,9 +1968,9 @@
           pairs to an output file.</p>
 <p>RecordWriter implementations write the job outputs to the 
           <span class="codefrag">FileSystem</span>.</p>
-<a name="N10ABF"></a><a name="Other+Useful+Features"></a>
+<a name="N10BAF"></a><a name="Other+Useful+Features"></a>
 <h3 class="h4">Other Useful Features</h3>
-<a name="N10AC5"></a><a name="Counters"></a>
+<a name="N10BB5"></a><a name="Counters"></a>
 <span class="codefrag">Counters</span> represent global counters, defined either
@@ -1856,7 +1984,7 @@
           Reporter.incrCounter(Enum, long)</a> in the <span class="codefrag">map</span>
           <span class="codefrag">reduce</span> methods. These counters are then
           aggregated by the framework.</p>
-<a name="N10AF0"></a><a name="DistributedCache"></a>
+<a name="N10BE0"></a><a name="DistributedCache"></a>
 <a href="api/org/apache/hadoop/filecache/DistributedCache.html">
@@ -1890,7 +2018,7 @@
           <a href="api/org/apache/hadoop/filecache/DistributedCache.html#createSymlink(org.apache.hadoop.conf.Configuration)">
           DistributedCache.createSymlink(Configuration)</a> api. Files 
           have <em>execution permissions</em> set.</p>
-<a name="N10B2E"></a><a name="Tool"></a>
+<a name="N10C1E"></a><a name="Tool"></a>
 <p>The <a href="api/org/apache/hadoop/util/Tool.html">Tool</a> 
           interface supports the handling of generic Hadoop command-line options.
@@ -1930,7 +2058,7 @@
-<a name="N10B60"></a><a name="IsolationRunner"></a>
+<a name="N10C50"></a><a name="IsolationRunner"></a>
 <a href="api/org/apache/hadoop/mapred/IsolationRunner.html">
@@ -1954,7 +2082,7 @@
 <span class="codefrag">IsolationRunner</span> will run the failed task in a single

           jvm, which can be in the debugger, over precisely the same input.</p>
-<a name="N10B93"></a><a name="Debugging"></a>
+<a name="N10C83"></a><a name="Debugging"></a>
 <p>Map/Reduce framework provides a facility to run user-provided 
           scripts for debugging. When map/reduce task fails, user can run 
@@ -1965,7 +2093,7 @@
 <p> In the following sections we discuss how to submit debug script
           along with the job. For submitting debug script, first it has to
           distributed. Then the script has to supplied in Configuration. </p>
-<a name="N10B9F"></a><a name="How+to+distribute+script+file%3A"></a>
+<a name="N10C8F"></a><a name="How+to+distribute+script+file%3A"></a>
 <h5> How to distribute script file: </h5>
           To distribute  the debug script file, first copy the file to the dfs.
@@ -1988,7 +2116,7 @@
           <a href="api/org/apache/hadoop/filecache/DistributedCache.html#createSymlink(org.apache.hadoop.conf.Configuration)">
           DistributedCache.createSymLink(Configuration) </a> api.
-<a name="N10BB8"></a><a name="How+to+submit+script%3A"></a>
+<a name="N10CA8"></a><a name="How+to+submit+script%3A"></a>
 <h5> How to submit script: </h5>
 <p> A quick way to submit debug script is to set values for the 
           properties "mapred.map.task.debug.script" and 
@@ -2012,17 +2140,17 @@
 <span class="codefrag">$script $stdout $stderr $syslog $jobconf $program </span>
-<a name="N10BDA"></a><a name="Default+Behavior%3A"></a>
+<a name="N10CCA"></a><a name="Default+Behavior%3A"></a>
 <h5> Default Behavior: </h5>
 <p> For pipes, a default script is run to process core dumps under
           gdb, prints stack trace and gives info about running threads. </p>
-<a name="N10BE5"></a><a name="JobControl"></a>
+<a name="N10CD5"></a><a name="JobControl"></a>
 <a href="api/org/apache/hadoop/mapred/jobcontrol/package-summary.html">
           JobControl</a> is a utility which encapsulates a set of Map-Reduce jobs
           and their dependencies.</p>
-<a name="N10BF2"></a><a name="Data+Compression"></a>
+<a name="N10CE2"></a><a name="Data+Compression"></a>
 <h4>Data Compression</h4>
 <p>Hadoop Map-Reduce provides facilities for the application-writer to
           specify compression for both intermediate map-outputs and the
@@ -2036,7 +2164,7 @@
           codecs for reasons of both performance (zlib) and non-availability of
           Java libraries (lzo). More details on their usage and availability are
           available <a href="native_libraries.html">here</a>.</p>
-<a name="N10C12"></a><a name="Intermediate+Outputs"></a>
+<a name="N10D02"></a><a name="Intermediate+Outputs"></a>
 <h5>Intermediate Outputs</h5>
 <p>Applications can control compression of intermediate map-outputs
             via the 
@@ -2057,7 +2185,7 @@
             <a href="api/org/apache/hadoop/mapred/JobConf.html#setMapOutputCompressionType(org.apache.hadoop.io.SequenceFile.CompressionType)">
-<a name="N10C3E"></a><a name="Job+Outputs"></a>
+<a name="N10D2E"></a><a name="Job+Outputs"></a>
 <h5>Job Outputs</h5>
 <p>Applications can control compression of job-outputs via the
             <a href="api/org/apache/hadoop/mapred/OutputFormatBase.html#setCompressOutput(org.apache.hadoop.mapred.JobConf,%20boolean)">
@@ -2077,7 +2205,7 @@
-<a name="N10C6D"></a><a name="Example%3A+WordCount+v2.0"></a>
+<a name="N10D5D"></a><a name="Example%3A+WordCount+v2.0"></a>
 <h2 class="h3">Example: WordCount v2.0</h2>
 <div class="section">
 <p>Here is a more complete <span class="codefrag">WordCount</span> which
uses many of the
@@ -2087,7 +2215,7 @@
       <a href="quickstart.html#SingleNodeSetup">pseudo-distributed</a> or
       <a href="quickstart.html#Fully-Distributed+Operation">fully-distributed</a>

       Hadoop installation.</p>
-<a name="N10C87"></a><a name="Source+Code-N10C87"></a>
+<a name="N10D77"></a><a name="Source+Code-N10D77"></a>
 <h3 class="h4">Source Code</h3>
 <table class="ForrestTable" cellspacing="1" cellpadding="4">
@@ -3297,7 +3425,7 @@
-<a name="N113E9"></a><a name="Sample+Runs"></a>
+<a name="N114D9"></a><a name="Sample+Runs"></a>
 <h3 class="h4">Sample Runs</h3>
 <p>Sample text-files as input:</p>
@@ -3465,7 +3593,7 @@
-<a name="N114BD"></a><a name="Highlights"></a>
+<a name="N115AD"></a><a name="Highlights"></a>
 <h3 class="h4">Highlights</h3>
 <p>The second version of <span class="codefrag">WordCount</span> improves
upon the 
         previous one by using some features offered by the Map-Reduce framework:

