hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From d...@apache.org
Subject svn commit: r692409 [1/2] - in /hadoop/core/trunk/docs: changes.html hadoop-default.html mapred_tutorial.html mapred_tutorial.pdf
Date Fri, 05 Sep 2008 10:56:47 GMT
Author: ddas
Date: Fri Sep  5 03:56:45 2008
New Revision: 692409

URL: http://svn.apache.org/viewvc?rev=692409&view=rev
Log:
HADOOP-3150. Adding the generated docs.

Modified:
    hadoop/core/trunk/docs/changes.html
    hadoop/core/trunk/docs/hadoop-default.html
    hadoop/core/trunk/docs/mapred_tutorial.html
    hadoop/core/trunk/docs/mapred_tutorial.pdf

Modified: hadoop/core/trunk/docs/changes.html
URL: http://svn.apache.org/viewvc/hadoop/core/trunk/docs/changes.html?rev=692409&r1=692408&r2=692409&view=diff
==============================================================================
--- hadoop/core/trunk/docs/changes.html (original)
+++ hadoop/core/trunk/docs/changes.html Fri Sep  5 03:56:45 2008
@@ -56,7 +56,7 @@
 </a></h2>
 <ul id="trunk_(unreleased_changes)_">
   <li><a href="javascript:toggleList('trunk_(unreleased_changes)_._incompatible_changes_')">  INCOMPATIBLE CHANGES
-</a>&nbsp;&nbsp;&nbsp;(10)
+</a>&nbsp;&nbsp;&nbsp;(12)
     <ol id="trunk_(unreleased_changes)_._incompatible_changes_">
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3595">HADOOP-3595</a>. Remove deprecated methods for mapred.combine.once
 functionality, which was necessary to providing backwards
@@ -85,10 +85,16 @@
 which is no longer needed.<br />(tomwhite via omalley)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3549">HADOOP-3549</a>. Give more meaningful errno's in libhdfs. In particular,
 EACCES is returned for permission problems.<br />(Ben Slusky via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4036">HADOOP-4036</a>. ResourceStatus was added to TaskTrackerStatus by <a href="http://issues.apache.org/jira/browse/HADOOP-3759">HADOOP-3759</a>,
+so increment the InterTrackerProtocol version.<br />(Hemanth Yamijala via
+omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3150">HADOOP-3150</a>. Moves task promotion to tasks. Defines a new interface for
+committing output files. Moves job setup to jobclient, and moves jobcleanup
+to a separate task.<br />(Amareshwari Sriramadasu via ddas)</li>
     </ol>
   </li>
   <li><a href="javascript:toggleList('trunk_(unreleased_changes)_._new_features_')">  NEW FEATURES
-</a>&nbsp;&nbsp;&nbsp;(14)
+</a>&nbsp;&nbsp;&nbsp;(23)
     <ol id="trunk_(unreleased_changes)_._new_features_">
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3341">HADOOP-3341</a>. Allow streaming jobs to specify the field separator for map
 and reduce input and output. The new configuration values are:
@@ -127,11 +133,26 @@
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3854">HADOOP-3854</a>. Add support for pluggable servlet filters in the HttpServers.
 (Tsz Wo (Nicholas) Sze via omalley)
 </li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3759">HADOOP-3759</a>. Provides ability to run memory intensive jobs without
+affecting other running tasks on the nodes.<br />(Hemanth Yamijala via ddas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3746">HADOOP-3746</a>. Add a fair share scheduler.<br />(Matei Zaharia via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3754">HADOOP-3754</a>. Add a thrift interface to access HDFS.<br />(dhruba via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3828">HADOOP-3828</a>. Provides a way to write skipped records to DFS.<br />(Sharad Agarwal via ddas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3948">HADOOP-3948</a>. Separate name-node edits and fsimage directories.<br />(Lohit Vijayarenu via shv)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3939">HADOOP-3939</a>. Add an option to DistCp to delete files at the destination
+not present at the source. (Tsz Wo (Nicholas) Sze via cdouglas)
+</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3601">HADOOP-3601</a>. Add a new contrib module for Hive, which is a sql-like
+query processing tool that uses map/reduce.<br />(Ashish Thusoo via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3866">HADOOP-3866</a>. Added sort and multi-job updates in the JobTracker web ui.<br />(Craig Weisenfluh via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3698">HADOOP-3698</a>. Add access control to control who is allowed to submit or
+modify jobs in the JobTracker.<br />(Hemanth Yamijala via omalley)</li>
     </ol>
   </li>
   <li><a href="javascript:toggleList('trunk_(unreleased_changes)_._improvements_')">  IMPROVEMENTS
-</a>&nbsp;&nbsp;&nbsp;(40)
+</a>&nbsp;&nbsp;&nbsp;(48)
     <ol id="trunk_(unreleased_changes)_._improvements_">
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3908">HADOOP-3908</a>. Fuse-dfs: better error message if llibhdfs.so doesn't exist.<br />(Pete Wyckoff through zshao)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3732">HADOOP-3732</a>. Delay intialization of datanode block verification till
 the verification thread is started.<br />(rangadi)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-1627">HADOOP-1627</a>. Various small improvements to 'dfsadmin -report' output.<br />(rangadi)</li>
@@ -205,10 +226,22 @@
 generative javadoc for developers.<br />(Sanjay Radia via omalley)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3944">HADOOP-3944</a>. Improve documentation for public TupleWritable class in
 join package.<br />(Chris Douglas via enis)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-2330">HADOOP-2330</a>. Preallocate HDFS transaction log to improve performance.<br />(dhruba and hairong)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3965">HADOOP-3965</a>. Convert DataBlockScanner into a package private class.<br />(shv)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3488">HADOOP-3488</a>. Prevent hadoop-daemon from rsync'ing log files<br />(Stefan
+Groshupf and Craig Macdonald via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3342">HADOOP-3342</a>. Change the kill task actions to require http post instead of
+get to prevent accidental crawls from triggering it.<br />(enis via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3937">HADOOP-3937</a>. Limit the job name in the job history filename to 50
+characters.<br />(Matei Zaharia via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3943">HADOOP-3943</a>. Remove unnecessary synchronization in
+NetworkTopology.pseudoSortByDistance.<br />(hairong via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3498">HADOOP-3498</a>. File globbing alternation should be able to span path
+components.<br />(tomwhite)</li>
     </ol>
   </li>
   <li><a href="javascript:toggleList('trunk_(unreleased_changes)_._optimizations_')">  OPTIMIZATIONS
-</a>&nbsp;&nbsp;&nbsp;(6)
+</a>&nbsp;&nbsp;&nbsp;(7)
     <ol id="trunk_(unreleased_changes)_._optimizations_">
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3556">HADOOP-3556</a>. Removed lock contention in MD5Hash by changing the
 singleton MessageDigester by an instance per Thread using
@@ -223,10 +256,12 @@
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3816">HADOOP-3816</a>. Faster directory listing in KFS.<br />(Sriram Rao via omalley)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-2130">HADOOP-2130</a>. Pipes submit job should have both blocking and non-blocking
 versions.<br />(acmurthy via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3769">HADOOP-3769</a>. Make the SampleMapper and SampleReducer from
+GenericMRLoadGenerator public, so they can be used in other contexts.<br />(Lingyun Yang via omalley)</li>
     </ol>
   </li>
   <li><a href="javascript:toggleList('trunk_(unreleased_changes)_._bug_fixes_')">  BUG FIXES
-</a>&nbsp;&nbsp;&nbsp;(44)
+</a>&nbsp;&nbsp;&nbsp;(55)
     <ol id="trunk_(unreleased_changes)_._bug_fixes_">
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3563">HADOOP-3563</a>.  Refactor the distributed upgrade code so that it is
 easier to identify datanode and namenode related code.<br />(dhruba)</li>
@@ -314,6 +349,30 @@
 cdouglas)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3705">HADOOP-3705</a>. Fix mapred.join parser to accept InputFormats named with
 underscore and static, inner classes.<br />(cdouglas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4023">HADOOP-4023</a>. Fix javadoc warnings introduced when the HDFS javadoc was
+made private.<br />(omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4030">HADOOP-4030</a>. Remove lzop from the default list of codecs.<br />(Arun Murthy via
+cdouglas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3961">HADOOP-3961</a>. Fix task disk space requirement estimates for virtual
+input jobs. Delays limiting task placement until after 10% of the maps
+have finished.<br />(Ari Rabkin via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-2168">HADOOP-2168</a>. Fix problem with C++ record reader's progress not being
+reported to framework.<br />(acmurthy via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3966">HADOOP-3966</a>. Copy findbugs generated output files to PATCH_DIR while
+running test-patch.<br />(Ramya R via lohit)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4037">HADOOP-4037</a>. Fix the eclipse plugin for versions of kfs and log4j.<br />(nigel
+via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3950">HADOOP-3950</a>. Cause the Mini MR cluster to wait for task trackers to
+register before continuing.<br />(enis via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3910">HADOOP-3910</a>. Remove unused ClusterTestDFSNamespaceLogging and
+ClusterTestDFS. (Tsz Wo (Nicholas), SZE via cdouglas)
+</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3954">HADOOP-3954</a>. Disable record skipping by default.<br />(Sharad Agarwal via
+cdouglas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4050">HADOOP-4050</a>. Fix TestFailScheduler to use absolute paths for the work
+directory.<br />(Matei Zaharia via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4069">HADOOP-4069</a>. Keep temporary test files from TestKosmosFileSystem under
+test.build.data instead of /tmp.<br />(lohit via omalley)</li>
     </ol>
   </li>
 </ul>
@@ -321,10 +380,17 @@
 </a></h2>
 <ul id="release_0.18.1_-_unreleased_">
   <li><a href="javascript:toggleList('release_0.18.1_-_unreleased_._bug_fixes_')">  BUG FIXES
-</a>&nbsp;&nbsp;&nbsp;(1)
+</a>&nbsp;&nbsp;&nbsp;(4)
     <ol id="release_0.18.1_-_unreleased_._bug_fixes_">
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3995">HADOOP-3995</a>. In case of quota failure on HDFS, rename does not restore
+source filename.<br />(rangadi)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3821">HADOOP-3821</a>. Prevent SequenceFile and IFile from duplicating codecs in
 CodecPool when closed more than once.<br />(Arun Murthy via cdouglas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4040">HADOOP-4040</a>. Remove coded default of the IPC idle connection timeout
+from the TaskTracker, which was causing HDFS client connections to not be
+collected.<br />(ddas via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4046">HADOOP-4046</a>. Made WritableComparable's constructor protected instead of
+private to re-enable class derivation.<br />(cdouglas via omalley)</li>
     </ol>
   </li>
 </ul>
@@ -471,7 +537,7 @@
     <ol id="release_0.18.0_-_2008-08-19_._improvements_">
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3677">HADOOP-3677</a>. Simplify generation stamp upgrade by making is a
 local upgrade on datandodes. Deleted distributed upgrade.<br />(rangadi)</li>
-      <li><a href="http://issues.apache.org/jira/browse/HADOOP-2928">HADOOP-2928</a>. Remove deprecated FileSystem.getContentLength().<br />(Lohit Vjayarenu via rangadi)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-2928">HADOOP-2928</a>. Remove deprecated FileSystem.getContentLength().<br />(Lohit Vijayarenu via rangadi)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3130">HADOOP-3130</a>. Make the connect timeout smaller for getFile.<br />(Amar Ramesh Kamat via ddas)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3160">HADOOP-3160</a>. Remove deprecated exists() from ClientProtocol and
 FSNamesystem<br />(Lohit Vjayarenu via rangadi)</li>

Modified: hadoop/core/trunk/docs/hadoop-default.html
URL: http://svn.apache.org/viewvc/hadoop/core/trunk/docs/hadoop-default.html?rev=692409&r1=692408&r2=692409&view=diff
==============================================================================
--- hadoop/core/trunk/docs/hadoop-default.html (original)
+++ hadoop/core/trunk/docs/hadoop-default.html Fri Sep  5 03:56:45 2008
@@ -84,7 +84,7 @@
   facilitate opening large map files using less memory.</td>
 </tr>
 <tr>
-<td><a name="io.compression.codecs">io.compression.codecs</a></td><td>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</td><td>A list of the compression codec classes that can be used 
+<td><a name="io.compression.codecs">io.compression.codecs</a></td><td>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec</td><td>A list of the compression codec classes that can be used 
                for compression/decompression.</td>
 </tr>
 <tr>
@@ -138,12 +138,20 @@
 </tr>
 <tr>
 <td><a name="fs.checkpoint.dir">fs.checkpoint.dir</a></td><td>${hadoop.tmp.dir}/dfs/namesecondary</td><td>Determines where on the local filesystem the DFS secondary
-      name node should store the temporary images and edits to merge.
+      name node should store the temporary images to merge.
       If this is a comma-delimited list of directories then the image is
       replicated in all of the directories for redundancy.
   </td>
 </tr>
 <tr>
+<td><a name="fs.checkpoint.edits.dir">fs.checkpoint.edits.dir</a></td><td>${fs.checkpoint.dir}</td><td>Determines where on the local filesystem the DFS secondary
+      name node should store the temporary edits to merge.
+      If this is a comma-delimited list of directoires then teh edits is
+      replicated in all of the directoires for redundancy.
+      Default value is same as fs.checkpoint.dir
+  </td>
+</tr>
+<tr>
 <td><a name="fs.checkpoint.period">fs.checkpoint.period</a></td><td>3600</td><td>The number of seconds between two periodic checkpoints.
   </td>
 </tr>
@@ -225,11 +233,18 @@
 </tr>
 <tr>
 <td><a name="dfs.name.dir">dfs.name.dir</a></td><td>${hadoop.tmp.dir}/dfs/name</td><td>Determines where on the local filesystem the DFS name node
-      should store the name table.  If this is a comma-delimited list
+      should store the name table(fsimage).  If this is a comma-delimited list
       of directories then the name table is replicated in all of the
       directories, for redundancy. </td>
 </tr>
 <tr>
+<td><a name="dfs.name.edits.dir">dfs.name.edits.dir</a></td><td>${dfs.name.dir}</td><td>Determines where on the local filesystem the DFS name node
+      should store the transaction (edits) file. If this is a comma-delimited list
+      of directories then the transaction file is replicated in all of the 
+      directories, for redundancy. Default value is same as dfs.name.dir
+  </td>
+</tr>
+<tr>
 <td><a name="dfs.web.ugi">dfs.web.ugi</a></td><td>webuser,webgroup</td><td>The user account used by the web interface.
     Syntax: USERNAME,GROUP1,GROUP2, ...
   </td>
@@ -686,6 +701,39 @@
     </td>
 </tr>
 <tr>
+<td><a name="mapred.skip.mode.enabled">mapred.skip.mode.enabled</a></td><td>false</td><td> Indicates whether skipping of bad records is enabled or not.
+    If enabled the framework will try to find bad records and skip  
+    them on further attempts.
+    </td>
+</tr>
+<tr>
+<td><a name="mapred.skip.attempts.to.start.skipping">mapred.skip.attempts.to.start.skipping</a></td><td>2</td><td> The number of Task attempts AFTER which skip mode 
+    will be kicked off. When skip mode is kicked off, the 
+    tasks reports the range of records which it will process 
+    next, to the TaskTracker. So that on failures, TT knows which 
+    ones are possibly the bad records. On further executions, 
+    those are skipped.
+    </td>
+</tr>
+<tr>
+<td><a name="mapred.skip.map.auto.incr.proc.count">mapred.skip.map.auto.incr.proc.count</a></td><td>true</td><td> The flag which if set to true, 
+    Counters.Application.MAP_PROCESSED_RECORDS is incremented 
+    by MapRunner after invoking the map function. This value must be set to 
+    false for applications which process the records asynchronously 
+    or buffer the input records. For example streaming. 
+    In such cases applications should increment this counter on their own.
+    </td>
+</tr>
+<tr>
+<td><a name="mapred.skip.reduce.auto.incr.proc.count">mapred.skip.reduce.auto.incr.proc.count</a></td><td>true</td><td> The flag which if set to true, 
+    Counters.Application.REDUCE_PROCESSED_RECORDS is incremented 
+    by framework after invoking the reduce function. This value must be set to 
+    false for applications which process the records asynchronously 
+    or buffer the input records. For example streaming. 
+    In such cases applications should increment this counter on their own.
+    </td>
+</tr>
+<tr>
 <td><a name="ipc.client.idlethreshold">ipc.client.idlethreshold</a></td><td>4000</td><td>Defines the threshold number of connections after which
                connections will be inspected for idleness.
   </td>
@@ -807,6 +855,52 @@
     killed even though these limits are not reached.
   </td>
 </tr>
+<tr>
+<td><a name="mapred.queue.names">mapred.queue.names</a></td><td>default</td><td> Comma separated list of queues configured for this jobtracker.
+    Jobs are added to queues and schedulers can configure different 
+    scheduling properties for the various queues. To configure a property 
+    for a queue, the name of the queue must match the name specified in this 
+    value. Queue properties that are common to all schedulers are configured 
+    here with the naming convention, mapred.queue.$QUEUE-NAME.$PROPERTY-NAME,
+    for e.g. mapred.queue.default.submit-job-acl.
+    The number of queues configured in this parameter could depend on the
+    type of scheduler being used, as specified in 
+    mapred.jobtracker.taskScheduler. For example, the JobQueueTaskScheduler
+    supports only a single queue, which is the default configured here.
+    Before adding more queues, ensure that the scheduler you've configured
+    supports multiple queues.
+  </td>
+</tr>
+<tr>
+<td><a name="mapred.acls.enabled">mapred.acls.enabled</a></td><td>false</td><td> Specifies whether ACLs are enabled, and should be checked
+    for various operations.
+  </td>
+</tr>
+<tr>
+<td><a name="mapred.queue.default.acl-submit-job">mapred.queue.default.acl-submit-job</a></td><td>*</td><td> Comma separated list of user and group names that are allowed
+    to submit jobs to the 'default' queue. The user list and the group list
+    are separated by a blank. For e.g. alice,bob group1,group2. 
+    If set to the special value '*', it means all users are allowed to 
+    submit jobs. 
+  </td>
+</tr>
+<tr>
+<td><a name="mapred.queue.default.acl-administer-jobs">mapred.queue.default.acl-administer-jobs</a></td><td>*</td><td> Comma separated list of user and group names that are allowed
+    to delete jobs or modify job's priority for jobs not owned by the current
+    user in the 'default' queue. The user list and the group list
+    are separated by a blank. For e.g. alice,bob group1,group2. 
+    If set to the special value '*', it means all users are allowed to do 
+    this operation.
+  </td>
+</tr>
+<tr>
+<td><a name="queue.name">queue.name</a></td><td>default</td><td> Queue to which a job is submitted. This must match one of the
+    queues defined in mapred.queue.names for the system. Also, the ACL setup
+    for the queue must allow the current user to submit a job to the queue.
+    Before specifying a queue, ensure that the system is configured with 
+    the queue, and access is allowed for submitting jobs to the queue.
+  </td>
+</tr>
 </table>
 </body>
 </html>

Modified: hadoop/core/trunk/docs/mapred_tutorial.html
URL: http://svn.apache.org/viewvc/hadoop/core/trunk/docs/mapred_tutorial.html?rev=692409&r1=692408&r2=692409&view=diff
==============================================================================
--- hadoop/core/trunk/docs/mapred_tutorial.html (original)
+++ hadoop/core/trunk/docs/mapred_tutorial.html Fri Sep  5 03:56:45 2008
@@ -268,6 +268,9 @@
 <a href="#Job+Output">Job Output</a>
 <ul class="minitoc">
 <li>
+<a href="#OutputCommitter">OutputCommitter</a>
+</li>
+<li>
 <a href="#Task+Side-Effect+Files">Task Side-Effect Files</a>
 </li>
 <li>
@@ -310,7 +313,7 @@
 <a href="#Example%3A+WordCount+v2.0">Example: WordCount v2.0</a>
 <ul class="minitoc">
 <li>
-<a href="#Source+Code-N10E0A">Source Code</a>
+<a href="#Source+Code-N10E46">Source Code</a>
 </li>
 <li>
 <a href="#Sample+Runs">Sample Runs</a>
@@ -1258,16 +1261,17 @@
 <p>We will then discuss other core interfaces including 
       <span class="codefrag">JobConf</span>, <span class="codefrag">JobClient</span>, <span class="codefrag">Partitioner</span>, 
       <span class="codefrag">OutputCollector</span>, <span class="codefrag">Reporter</span>, 
-      <span class="codefrag">InputFormat</span>, <span class="codefrag">OutputFormat</span> and others.</p>
+      <span class="codefrag">InputFormat</span>, <span class="codefrag">OutputFormat</span>,
+      <span class="codefrag">OutputCommitter</span> and others.</p>
 <p>Finally, we will wrap up by discussing some useful features of the
       framework such as the <span class="codefrag">DistributedCache</span>, 
       <span class="codefrag">IsolationRunner</span> etc.</p>
-<a name="N105FC"></a><a name="Payload"></a>
+<a name="N105FF"></a><a name="Payload"></a>
 <h3 class="h4">Payload</h3>
 <p>Applications typically implement the <span class="codefrag">Mapper</span> and 
         <span class="codefrag">Reducer</span> interfaces to provide the <span class="codefrag">map</span> and 
         <span class="codefrag">reduce</span> methods. These form the core of the job.</p>
-<a name="N10611"></a><a name="Mapper"></a>
+<a name="N10614"></a><a name="Mapper"></a>
 <h4>Mapper</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/Mapper.html">
@@ -1323,7 +1327,7 @@
           <a href="api/org/apache/hadoop/io/compress/CompressionCodec.html">
           CompressionCodec</a> to be used via the <span class="codefrag">JobConf</span>.
           </p>
-<a name="N10687"></a><a name="How+Many+Maps%3F"></a>
+<a name="N1068A"></a><a name="How+Many+Maps%3F"></a>
 <h5>How Many Maps?</h5>
 <p>The number of maps is usually driven by the total size of the 
             inputs, that is, the total number of blocks of the input files.</p>
@@ -1336,7 +1340,7 @@
             <a href="api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)">
             setNumMapTasks(int)</a> (which only provides a hint to the framework) 
             is used to set it even higher.</p>
-<a name="N1069F"></a><a name="Reducer"></a>
+<a name="N106A2"></a><a name="Reducer"></a>
 <h4>Reducer</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/Reducer.html">
@@ -1359,18 +1363,18 @@
 <p>
 <span class="codefrag">Reducer</span> has 3 primary phases: shuffle, sort and reduce.
           </p>
-<a name="N106CF"></a><a name="Shuffle"></a>
+<a name="N106D2"></a><a name="Shuffle"></a>
 <h5>Shuffle</h5>
 <p>Input to the <span class="codefrag">Reducer</span> is the sorted output of the
             mappers. In this phase the framework fetches the relevant partition 
             of the output of all the mappers, via HTTP.</p>
-<a name="N106DC"></a><a name="Sort"></a>
+<a name="N106DF"></a><a name="Sort"></a>
 <h5>Sort</h5>
 <p>The framework groups <span class="codefrag">Reducer</span> inputs by keys (since 
             different mappers may have output the same key) in this stage.</p>
 <p>The shuffle and sort phases occur simultaneously; while 
             map-outputs are being fetched they are merged.</p>
-<a name="N106EB"></a><a name="Secondary+Sort"></a>
+<a name="N106EE"></a><a name="Secondary+Sort"></a>
 <h5>Secondary Sort</h5>
 <p>If equivalence rules for grouping the intermediate keys are 
               required to be different from those for grouping keys before 
@@ -1381,7 +1385,7 @@
               JobConf.setOutputKeyComparatorClass(Class)</a> can be used to 
               control how intermediate keys are grouped, these can be used in 
               conjunction to simulate <em>secondary sort on values</em>.</p>
-<a name="N10704"></a><a name="Reduce"></a>
+<a name="N10707"></a><a name="Reduce"></a>
 <h5>Reduce</h5>
 <p>In this phase the 
             <a href="api/org/apache/hadoop/mapred/Reducer.html#reduce(K2, java.util.Iterator, org.apache.hadoop.mapred.OutputCollector, org.apache.hadoop.mapred.Reporter)">
@@ -1397,7 +1401,7 @@
             progress, set application-level status messages and update 
             <span class="codefrag">Counters</span>, or just indicate that they are alive.</p>
 <p>The output of the <span class="codefrag">Reducer</span> is <em>not sorted</em>.</p>
-<a name="N10732"></a><a name="How+Many+Reduces%3F"></a>
+<a name="N10735"></a><a name="How+Many+Reduces%3F"></a>
 <h5>How Many Reduces?</h5>
 <p>The right number of reduces seems to be <span class="codefrag">0.95</span> or 
             <span class="codefrag">1.75</span> multiplied by (&lt;<em>no. of nodes</em>&gt; * 
@@ -1412,7 +1416,7 @@
 <p>The scaling factors above are slightly less than whole numbers to 
             reserve a few reduce slots in the framework for speculative-tasks and
             failed tasks.</p>
-<a name="N10757"></a><a name="Reducer+NONE"></a>
+<a name="N1075A"></a><a name="Reducer+NONE"></a>
 <h5>Reducer NONE</h5>
 <p>It is legal to set the number of reduce-tasks to <em>zero</em> if 
             no reduction is desired.</p>
@@ -1422,7 +1426,7 @@
             setOutputPath(Path)</a>. The framework does not sort the 
             map-outputs before writing them out to the <span class="codefrag">FileSystem</span>.
             </p>
-<a name="N10772"></a><a name="Partitioner"></a>
+<a name="N10775"></a><a name="Partitioner"></a>
 <h4>Partitioner</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/Partitioner.html">
@@ -1436,7 +1440,7 @@
 <p>
 <a href="api/org/apache/hadoop/mapred/lib/HashPartitioner.html">
           HashPartitioner</a> is the default <span class="codefrag">Partitioner</span>.</p>
-<a name="N10791"></a><a name="Reporter"></a>
+<a name="N10794"></a><a name="Reporter"></a>
 <h4>Reporter</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/Reporter.html">
@@ -1455,7 +1459,7 @@
           </p>
 <p>Applications can also update <span class="codefrag">Counters</span> using the 
           <span class="codefrag">Reporter</span>.</p>
-<a name="N107BB"></a><a name="OutputCollector"></a>
+<a name="N107BE"></a><a name="OutputCollector"></a>
 <h4>OutputCollector</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/OutputCollector.html">
@@ -1466,7 +1470,7 @@
 <p>Hadoop Map/Reduce comes bundled with a 
         <a href="api/org/apache/hadoop/mapred/lib/package-summary.html">
         library</a> of generally useful mappers, reducers, and partitioners.</p>
-<a name="N107D6"></a><a name="Job+Configuration"></a>
+<a name="N107D9"></a><a name="Job+Configuration"></a>
 <h3 class="h4">Job Configuration</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/JobConf.html">
@@ -1498,8 +1502,9 @@
 <p>
 <span class="codefrag">JobConf</span> is typically used to specify the 
         <span class="codefrag">Mapper</span>, combiner (if any), <span class="codefrag">Partitioner</span>, 
-        <span class="codefrag">Reducer</span>, <span class="codefrag">InputFormat</span> and 
-        <span class="codefrag">OutputFormat</span> implementations. <span class="codefrag">JobConf</span> also 
+        <span class="codefrag">Reducer</span>, <span class="codefrag">InputFormat</span>, 
+        <span class="codefrag">OutputFormat</span> and <span class="codefrag">OutputCommitter</span> 
+        implementations. <span class="codefrag">JobConf</span> also 
         indicates the set of input files 
         (<a href="api/org/apache/hadoop/mapred/FileInputFormat.html#setInputPaths(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path[])">setInputPaths(JobConf, Path...)</a>
         /<a href="api/org/apache/hadoop/mapred/FileInputFormat.html#addInputPath(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path)">addInputPath(JobConf, Path)</a>)
@@ -1524,7 +1529,7 @@
         <a href="api/org/apache/hadoop/conf/Configuration.html#set(java.lang.String, java.lang.String)">set(String, String)</a>/<a href="api/org/apache/hadoop/conf/Configuration.html#get(java.lang.String, java.lang.String)">get(String, String)</a>
         to set/get arbitrary parameters needed by applications. However, use the 
         <span class="codefrag">DistributedCache</span> for large amounts of (read-only) data.</p>
-<a name="N10868"></a><a name="Task+Execution+%26+Environment"></a>
+<a name="N1086E"></a><a name="Task+Execution+%26+Environment"></a>
 <h3 class="h4">Task Execution &amp; Environment</h3>
 <p>The <span class="codefrag">TaskTracker</span> executes the <span class="codefrag">Mapper</span>/ 
         <span class="codefrag">Reducer</span>  <em>task</em> as a child process in a separate jvm.
@@ -1778,7 +1783,7 @@
         <a href="native_libraries.html#Loading+native+libraries+through+DistributedCache">
         native_libraries.html</a>
 </p>
-<a name="N10A1D"></a><a name="Job+Submission+and+Monitoring"></a>
+<a name="N10A23"></a><a name="Job+Submission+and+Monitoring"></a>
 <h3 class="h4">Job Submission and Monitoring</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/JobClient.html">
@@ -1839,7 +1844,7 @@
 <p>Normally the user creates the application, describes various facets 
         of the job via <span class="codefrag">JobConf</span>, and then uses the 
         <span class="codefrag">JobClient</span> to submit the job and monitor its progress.</p>
-<a name="N10A7D"></a><a name="Job+Control"></a>
+<a name="N10A83"></a><a name="Job+Control"></a>
 <h4>Job Control</h4>
 <p>Users may need to chain Map/Reduce jobs to accomplish complex
           tasks which cannot be done via a single Map/Reduce job. This is fairly
@@ -1875,7 +1880,7 @@
             </li>
           
 </ul>
-<a name="N10AA7"></a><a name="Job+Input"></a>
+<a name="N10AAD"></a><a name="Job+Input"></a>
 <h3 class="h4">Job Input</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/InputFormat.html">
@@ -1923,7 +1928,7 @@
         appropriate <span class="codefrag">CompressionCodec</span>. However, it must be noted that
         compressed files with the above extensions cannot be <em>split</em> and 
         each compressed file is processed in its entirety by a single mapper.</p>
-<a name="N10B11"></a><a name="InputSplit"></a>
+<a name="N10B17"></a><a name="InputSplit"></a>
 <h4>InputSplit</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/InputSplit.html">
@@ -1937,7 +1942,7 @@
           FileSplit</a> is the default <span class="codefrag">InputSplit</span>. It sets 
           <span class="codefrag">map.input.file</span> to the path of the input file for the
           logical split.</p>
-<a name="N10B36"></a><a name="RecordReader"></a>
+<a name="N10B3C"></a><a name="RecordReader"></a>
 <h4>RecordReader</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/RecordReader.html">
@@ -1949,7 +1954,7 @@
           for processing. <span class="codefrag">RecordReader</span> thus assumes the 
           responsibility of processing record boundaries and presents the tasks 
           with keys and values.</p>
-<a name="N10B59"></a><a name="Job+Output"></a>
+<a name="N10B5F"></a><a name="Job+Output"></a>
 <h3 class="h4">Job Output</h3>
 <p>
 <a href="api/org/apache/hadoop/mapred/OutputFormat.html">
@@ -1974,7 +1979,51 @@
 <p>
 <span class="codefrag">TextOutputFormat</span> is the default 
         <span class="codefrag">OutputFormat</span>.</p>
-<a name="N10B82"></a><a name="Task+Side-Effect+Files"></a>
+<a name="N10B88"></a><a name="OutputCommitter"></a>
+<h4>OutputCommitter</h4>
+<p>
+<a href="api/org/apache/hadoop/mapred/OutputCommitter.html">
+        OutputCommitter</a> describes the commit of task output for a 
+        Map/Reduce job.</p>
+<p>The Map/Reduce framework relies on the <span class="codefrag">OutputCommitter</span>
+        of the job to:</p>
+<ol>
+          
+<li>
+            Setup the job during initialization. For example, create
+            the temporary output directory for the job during the
+            initialization of the job. The job client does the setup
+            for the job.
+          </li>
+          
+<li>
+            Cleanup the job after the job completion. For example, remove the
+            temporary output directory after the job completion. A separate 
+            task does the cleanupJob at the end of the job.
+          </li>
+          
+<li>
+            Setup the task temporary output.
+          </li> 
+          
+<li>
+            Check whether a task needs a commit. This is to avoid the commit
+            procedure if a task does not need commit.
+          </li>
+          
+<li>
+            Commit of the task output. 
+          </li> 
+          
+<li>
+            Discard the task commit.
+          </li>
+        
+</ol>
+<p>
+<span class="codefrag">FileOutputCommitter</span> is the default 
+        <span class="codefrag">OutputCommitter</span>.</p>
+<a name="N10BB8"></a><a name="Task+Side-Effect+Files"></a>
 <h4>Task Side-Effect Files</h4>
 <p>In some applications, component tasks need to create and/or write to
           side-files, which differ from the actual job-output files.</p>
@@ -1985,7 +2034,9 @@
           application-writer will have to pick unique names per task-attempt 
           (using the attemptid, say <span class="codefrag">attempt_200709221812_0001_m_000000_0</span>), 
           not just per task.</p>
-<p>To avoid these issues the Map/Reduce framework maintains a special 
+<p>To avoid these issues the Map/Reduce framework, when the 
+          <span class="codefrag">OutputCommitter</span> is <span class="codefrag">FileOutputCommitter</span>, 
+          maintains a special 
           <span class="codefrag">${mapred.output.dir}/_temporary/_${taskid}</span> sub-directory
           accessible via <span class="codefrag">${mapred.work.output.dir}</span>
           for each task-attempt on the <span class="codefrag">FileSystem</span> where the output
@@ -2013,7 +2064,7 @@
 <p>The entire discussion holds true for maps of jobs with 
            reducer=NONE (i.e. 0 reduces) since output of the map, in that case, 
            goes directly to HDFS.</p>
-<a name="N10BCA"></a><a name="RecordWriter"></a>
+<a name="N10C06"></a><a name="RecordWriter"></a>
 <h4>RecordWriter</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/RecordWriter.html">
@@ -2021,9 +2072,9 @@
           pairs to an output file.</p>
 <p>RecordWriter implementations write the job outputs to the 
           <span class="codefrag">FileSystem</span>.</p>
-<a name="N10BE1"></a><a name="Other+Useful+Features"></a>
+<a name="N10C1D"></a><a name="Other+Useful+Features"></a>
 <h3 class="h4">Other Useful Features</h3>
-<a name="N10BE7"></a><a name="Counters"></a>
+<a name="N10C23"></a><a name="Counters"></a>
 <h4>Counters</h4>
 <p>
 <span class="codefrag">Counters</span> represent global counters, defined either by 
@@ -2040,7 +2091,7 @@
           in the <span class="codefrag">map</span> and/or 
           <span class="codefrag">reduce</span> methods. These counters are then globally 
           aggregated by the framework.</p>
-<a name="N10C16"></a><a name="DistributedCache"></a>
+<a name="N10C52"></a><a name="DistributedCache"></a>
 <h4>DistributedCache</h4>
 <p>
 <a href="api/org/apache/hadoop/filecache/DistributedCache.html">
@@ -2111,7 +2162,7 @@
           <span class="codefrag">mapred.job.classpath.{files|archives}</span>. Similarly the
           cached files that are symlinked into the working directory of the
           task can be used to distribute native libraries and load them.</p>
-<a name="N10C99"></a><a name="Tool"></a>
+<a name="N10CD5"></a><a name="Tool"></a>
 <h4>Tool</h4>
 <p>The <a href="api/org/apache/hadoop/util/Tool.html">Tool</a> 
           interface supports the handling of generic Hadoop command-line options.
@@ -2151,7 +2202,7 @@
             </span>
           
 </p>
-<a name="N10CCB"></a><a name="IsolationRunner"></a>
+<a name="N10D07"></a><a name="IsolationRunner"></a>
 <h4>IsolationRunner</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/IsolationRunner.html">
@@ -2175,7 +2226,7 @@
 <p>
 <span class="codefrag">IsolationRunner</span> will run the failed task in a single 
           jvm, which can be in the debugger, over precisely the same input.</p>
-<a name="N10CFE"></a><a name="Profiling"></a>
+<a name="N10D3A"></a><a name="Profiling"></a>
 <h4>Profiling</h4>
 <p>Profiling is a utility to get a representative (2 or 3) sample
           of built-in java profiler for a sample of maps and reduces. </p>
@@ -2208,7 +2259,7 @@
           <span class="codefrag">-agentlib:hprof=cpu=samples,heap=sites,force=n,thread=y,verbose=n,file=%s</span>
           
 </p>
-<a name="N10D32"></a><a name="Debugging"></a>
+<a name="N10D6E"></a><a name="Debugging"></a>
 <h4>Debugging</h4>
 <p>Map/Reduce framework provides a facility to run user-provided 
           scripts for debugging. When map/reduce task fails, user can run 
@@ -2219,14 +2270,14 @@
 <p> In the following sections we discuss how to submit debug script
           along with the job. For submitting debug script, first it has to
           distributed. Then the script has to supplied in Configuration. </p>
-<a name="N10D3E"></a><a name="How+to+distribute+script+file%3A"></a>
+<a name="N10D7A"></a><a name="How+to+distribute+script+file%3A"></a>
 <h5> How to distribute script file: </h5>
 <p>
           The user has to use 
           <a href="mapred_tutorial.html#DistributedCache">DistributedCache</a>
           mechanism to <em>distribute</em> and <em>symlink</em> the
           debug script file.</p>
-<a name="N10D52"></a><a name="How+to+submit+script%3A"></a>
+<a name="N10D8E"></a><a name="How+to+submit+script%3A"></a>
 <h5> How to submit script: </h5>
 <p> A quick way to submit debug script is to set values for the 
           properties "mapred.map.task.debug.script" and 
@@ -2250,17 +2301,17 @@
 <span class="codefrag">$script $stdout $stderr $syslog $jobconf $program </span>  
           
 </p>
-<a name="N10D74"></a><a name="Default+Behavior%3A"></a>
+<a name="N10DB0"></a><a name="Default+Behavior%3A"></a>
 <h5> Default Behavior: </h5>
 <p> For pipes, a default script is run to process core dumps under
           gdb, prints stack trace and gives info about running threads. </p>
-<a name="N10D7F"></a><a name="JobControl"></a>
+<a name="N10DBB"></a><a name="JobControl"></a>
 <h4>JobControl</h4>
 <p>
 <a href="api/org/apache/hadoop/mapred/jobcontrol/package-summary.html">
           JobControl</a> is a utility which encapsulates a set of Map/Reduce jobs
           and their dependencies.</p>
-<a name="N10D8C"></a><a name="Data+Compression"></a>
+<a name="N10DC8"></a><a name="Data+Compression"></a>
 <h4>Data Compression</h4>
 <p>Hadoop Map/Reduce provides facilities for the application-writer to
           specify compression for both intermediate map-outputs and the
@@ -2274,7 +2325,7 @@
           codecs for reasons of both performance (zlib) and non-availability of
           Java libraries (lzo). More details on their usage and availability are
           available <a href="native_libraries.html">here</a>.</p>
-<a name="N10DAC"></a><a name="Intermediate+Outputs"></a>
+<a name="N10DE8"></a><a name="Intermediate+Outputs"></a>
 <h5>Intermediate Outputs</h5>
 <p>Applications can control compression of intermediate map-outputs
             via the 
@@ -2283,7 +2334,7 @@
             <span class="codefrag">CompressionCodec</span> to be used via the
             <a href="api/org/apache/hadoop/mapred/JobConf.html#setMapOutputCompressorClass(java.lang.Class)">
             JobConf.setMapOutputCompressorClass(Class)</a> api.</p>
-<a name="N10DC1"></a><a name="Job+Outputs"></a>
+<a name="N10DFD"></a><a name="Job+Outputs"></a>
 <h5>Job Outputs</h5>
 <p>Applications can control compression of job-outputs via the
             <a href="api/org/apache/hadoop/mapred/FileOutputFormat.html#setCompressOutput(org.apache.hadoop.mapred.JobConf,%20boolean)">
@@ -2303,7 +2354,7 @@
 </div>
 
     
-<a name="N10DF0"></a><a name="Example%3A+WordCount+v2.0"></a>
+<a name="N10E2C"></a><a name="Example%3A+WordCount+v2.0"></a>
 <h2 class="h3">Example: WordCount v2.0</h2>
 <div class="section">
 <p>Here is a more complete <span class="codefrag">WordCount</span> which uses many of the
@@ -2313,7 +2364,7 @@
       <a href="quickstart.html#SingleNodeSetup">pseudo-distributed</a> or
       <a href="quickstart.html#Fully-Distributed+Operation">fully-distributed</a> 
       Hadoop installation.</p>
-<a name="N10E0A"></a><a name="Source+Code-N10E0A"></a>
+<a name="N10E46"></a><a name="Source+Code-N10E46"></a>
 <h3 class="h4">Source Code</h3>
 <table class="ForrestTable" cellspacing="1" cellpadding="4">
           
@@ -3523,7 +3574,7 @@
 </tr>
         
 </table>
-<a name="N1156C"></a><a name="Sample+Runs"></a>
+<a name="N115A8"></a><a name="Sample+Runs"></a>
 <h3 class="h4">Sample Runs</h3>
 <p>Sample text-files as input:</p>
 <p>
@@ -3691,7 +3742,7 @@
 <br>
         
 </p>
-<a name="N11640"></a><a name="Highlights"></a>
+<a name="N1167C"></a><a name="Highlights"></a>
 <h3 class="h4">Highlights</h3>
 <p>The second version of <span class="codefrag">WordCount</span> improves upon the 
         previous one by using some features offered by the Map/Reduce framework:



Mime
View raw message