hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From d...@apache.org
Subject svn commit: r702166 [1/3] - in /hadoop/core/branches/branch-0.19: ./ docs/ src/docs/src/documentation/content/xdocs/
Date Mon, 06 Oct 2008 14:38:38 GMT
Author: ddas
Date: Mon Oct  6 07:38:38 2008
New Revision: 702166

URL: http://svn.apache.org/viewvc?rev=702166&view=rev
Log:
Merge -r 702163:702164 from trunk onto 0.19 branch. Fixes HADOOP-4301.

Modified:
    hadoop/core/branches/branch-0.19/CHANGES.txt
    hadoop/core/branches/branch-0.19/docs/changes.html
    hadoop/core/branches/branch-0.19/docs/hadoop-default.html
    hadoop/core/branches/branch-0.19/docs/mapred_tutorial.html
    hadoop/core/branches/branch-0.19/docs/mapred_tutorial.pdf
    hadoop/core/branches/branch-0.19/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml
    hadoop/core/branches/branch-0.19/src/docs/src/documentation/content/xdocs/site.xml

Modified: hadoop/core/branches/branch-0.19/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/CHANGES.txt?rev=702166&r1=702165&r2=702166&view=diff
==============================================================================
--- hadoop/core/branches/branch-0.19/CHANGES.txt (original)
+++ hadoop/core/branches/branch-0.19/CHANGES.txt Mon Oct  6 07:38:38 2008
@@ -431,6 +431,9 @@
     incrementing the task attempt numbers by 1000 when the job restarts.
     (Amar Kamat via omalley)
 
+    HADOOP-4301. Adds forrest doc for the skip bad records feature.
+    (Sharad Agarwal via ddas)
+
   OPTIMIZATIONS
 
     HADOOP-3556. Removed lock contention in MD5Hash by changing the 

Modified: hadoop/core/branches/branch-0.19/docs/changes.html
URL: http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/docs/changes.html?rev=702166&r1=702165&r2=702166&view=diff
==============================================================================
--- hadoop/core/branches/branch-0.19/docs/changes.html (original)
+++ hadoop/core/branches/branch-0.19/docs/changes.html Mon Oct  6 07:38:38 2008
@@ -36,7 +36,7 @@
     function collapse() {
       for (var i = 0; i < document.getElementsByTagName("ul").length; i++) {
         var list = document.getElementsByTagName("ul")[i];
-        if (list.id != 'release_0.19.0_-_unreleased_' && list.id != 'release_0.18.1_-_2008-09-17_')
{
+        if (list.id != 'release_0.19.0_-_unreleased_' && list.id != 'release_0.18.2_-_unreleased_')
{
           list.style.display = "none";
         }
       }
@@ -56,7 +56,7 @@
 </a></h2>
 <ul id="release_0.19.0_-_unreleased_">
   <li><a href="javascript:toggleList('release_0.19.0_-_unreleased_._incompatible_changes_')">
 INCOMPATIBLE CHANGES
-</a>&nbsp;&nbsp;&nbsp;(18)
+</a>&nbsp;&nbsp;&nbsp;(20)
     <ol id="release_0.19.0_-_unreleased_._incompatible_changes_">
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3595">HADOOP-3595</a>.
Remove deprecated methods for mapred.combine.once
 functionality, which was necessary to providing backwards
@@ -110,10 +110,15 @@
 DFS Used%: DFS used space/Present Capacity<br />(Suresh Srinivas via hairong)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3938">HADOOP-3938</a>.
Disk space quotas for HDFS. This is similar to namespace
 quotas in 0.18.<br />(rangadi)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4293">HADOOP-4293</a>.
Make Configuration Writable and remove unreleased
+WritableJobConf. Configuration.write is renamed to writeXml.<br />(omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4281">HADOOP-4281</a>.
Change dfsadmin to report available disk space in a format
+consistent with the web interface as defined in <a href="http://issues.apache.org/jira/browse/HADOOP-2816">HADOOP-2816</a>.<br
/>(Suresh
+Srinivas via cdouglas)</li>
     </ol>
   </li>
   <li><a href="javascript:toggleList('release_0.19.0_-_unreleased_._new_features_')">
 NEW FEATURES
-</a>&nbsp;&nbsp;&nbsp;(39)
+</a>&nbsp;&nbsp;&nbsp;(40)
     <ol id="release_0.19.0_-_unreleased_._new_features_">
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3341">HADOOP-3341</a>.
Allow streaming jobs to specify the field separator for map
 and reduce input and output. The new configuration values are:
@@ -195,13 +200,16 @@
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3019">HADOOP-3019</a>.
A new library to support total order partitions.<br />(cdouglas via omalley)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3924">HADOOP-3924</a>.
Added a 'KILLED' job status.<br />(Subramaniam Krishnan via
 acmurthy)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-2421">HADOOP-2421</a>.
 Add jdiff output to documentation, listing all API
+changes from the prior release.<br />(cutting)</li>
     </ol>
   </li>
   <li><a href="javascript:toggleList('release_0.19.0_-_unreleased_._improvements_')">
 IMPROVEMENTS
-</a>&nbsp;&nbsp;&nbsp;(68)
+</a>&nbsp;&nbsp;&nbsp;(71)
     <ol id="release_0.19.0_-_unreleased_._improvements_">
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-4205">HADOOP-4205</a>.
hive: metastore and ql to use the refactored SerDe library.<br />(zshao)</li>
-      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4106">HADOOP-4106</a>.
libhdfs: add time, permission and user attribute support (part 2).<br />(Pete Wyckoff
through zshao)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4106">HADOOP-4106</a>.
libhdfs: add time, permission and user attribute support
+(part 2).<br />(Pete Wyckoff through zshao)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-4104">HADOOP-4104</a>.
libhdfs: add time, permission and user attribute support.<br />(Pete Wyckoff through
zshao)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3908">HADOOP-3908</a>.
libhdfs: better error message if llibhdfs.so doesn't exist.<br />(Pete Wyckoff through
zshao)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3732">HADOOP-3732</a>.
Delay intialization of datanode block verification till
@@ -230,8 +238,6 @@
 it pluggable.<br />(Tom White and Brice Arnould via omalley)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3756">HADOOP-3756</a>.
Minor. Remove unused dfs.client.buffer.dir from
 hadoop-default.xml.<br />(rangadi)</li>
-      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3327">HADOOP-3327</a>.
Treats connection and read timeouts differently in the
-shuffle and the backoff logic is dependent on the type of timeout.<br />(Jothi Padmanabhan
via ddas)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3747">HADOOP-3747</a>.
Adds counter suport for MultipleOutputs.<br />(Alejandro Abdelnur via ddas)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3169">HADOOP-3169</a>.
LeaseChecker daemon should not be started in DFSClient
 constructor. (TszWo (Nicholas), SZE via hairong)
@@ -321,6 +327,13 @@
 connection is closed and also remove an undesirable exception when
 a client is stoped while there is no pending RPC request.<br />(hairong)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-4227">HADOOP-4227</a>.
Remove the deprecated class org.apache.hadoop.fs.ShellCommand.<br />(szetszwo)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4006">HADOOP-4006</a>.
Clean up FSConstants and move some of the constants to
+better places.<br />(Sanjay Radia via rangadi)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4279">HADOOP-4279</a>.
Trace the seeds of random sequences in append unit tests to
+make itermitant failures reproducible.<br />(szetszwo via cdouglas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4209">HADOOP-4209</a>.
Remove the change to the format of task attempt id by
+incrementing the task attempt numbers by 1000 when the job restarts.<br />(Amar Kamat
via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4301">HADOOP-4301</a>.
Adds forrest doc for the skip bad records feature.<br />(Sharad Agarwal via ddas)</li>
     </ol>
   </li>
   <li><a href="javascript:toggleList('release_0.19.0_-_unreleased_._optimizations_')">
 OPTIMIZATIONS
@@ -347,7 +360,7 @@
     </ol>
   </li>
   <li><a href="javascript:toggleList('release_0.19.0_-_unreleased_._bug_fixes_')">
 BUG FIXES
-</a>&nbsp;&nbsp;&nbsp;(88)
+</a>&nbsp;&nbsp;&nbsp;(108)
     <ol id="release_0.19.0_-_unreleased_._bug_fixes_">
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-3563">HADOOP-3563</a>.
 Refactor the distributed upgrade code so that it is
 easier to identify datanode and namenode related code.<br />(dhruba)</li>
@@ -511,11 +524,71 @@
 query.<br />(Raghotham Murthy via dhruba)</li>
       <li><a href="http://issues.apache.org/jira/browse/HADOOP-4090">HADOOP-4090</a>.
The hive scripts pick up hadoop from HADOOP_HOME
 and then the path.<br />(Raghotham Murthy via dhruba)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4242">HADOOP-4242</a>.
Remove extra ";" in FSDirectory that blocks compilation
+in some IDE's.<br />(szetszwo via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4249">HADOOP-4249</a>.
Fix eclipse path to include the hsqldb.jar.<br />(szetszwo via
+omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4247">HADOOP-4247</a>.
Move InputSampler into org.apache.hadoop.mapred.lib, so that
+examples.jar doesn't depend on tools.jar.<br />(omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4269">HADOOP-4269</a>.
Fix the deprecation of LineReader by extending the new class
+into the old name and deprecating it. Also update the tests to test the
+new class.<br />(cdouglas via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4280">HADOOP-4280</a>.
Fix conversions between seconds in C and milliseconds in
+Java for access times for files.<br />(Pete Wyckoff via rangadi)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4254">HADOOP-4254</a>.
-setSpaceQuota command does not convert "TB" extenstion to
+terabytes properly. Implementation now uses StringUtils for parsing this.<br />(Raghu
Angadi)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4259">HADOOP-4259</a>.
Findbugs should run over tools.jar also.<br />(cdouglas via
+omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4275">HADOOP-4275</a>.
Move public method isJobValidName from JobID to a private
+method in JobTracker.<br />(omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4173">HADOOP-4173</a>.
fix failures in TestProcfsBasedProcessTree and
+TestTaskTrackerMemoryManager tests. ProcfsBasedProcessTree and
+memory management in TaskTracker are disabled on Windows.<br />(Vinod K V via rangadi)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4189">HADOOP-4189</a>.
Fixes the history blocksize &amp; intertracker protocol version
+issues introduced as part of <a href="http://issues.apache.org/jira/browse/HADOOP-3245">HADOOP-3245</a>.<br
/>(Amar Kamat via ddas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4190">HADOOP-4190</a>.
Fixes the backward compatibility issue with Job History.
+introduced by <a href="http://issues.apache.org/jira/browse/HADOOP-3245">HADOOP-3245</a>
and <a href="http://issues.apache.org/jira/browse/HADOOP-2403">HADOOP-2403</a>.<br
/>(Amar Kamat via ddas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4237">HADOOP-4237</a>.
Fixes the TestStreamingBadRecords.testNarrowDown testcase.<br />(Sharad Agarwal via
ddas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4274">HADOOP-4274</a>.
Capacity scheduler accidently modifies the underlying
+data structures when browing the job lists.<br />(Hemanth Yamijala via omalley)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4309">HADOOP-4309</a>.
Fix eclipse-plugin compilation.<br />(cdouglas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4232">HADOOP-4232</a>.
Fix race condition in JVM reuse when multiple slots become
+free.<br />(ddas via acmurthy)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4302">HADOOP-4302</a>.
Fix a race condition in TestReduceFetch that can yield false
+negatvies.<br />(cdouglas)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3942">HADOOP-3942</a>.
Update distcp documentation to include features introduced in
+<a href="http://issues.apache.org/jira/browse/HADOOP-3873">HADOOP-3873</a>, <a
href="http://issues.apache.org/jira/browse/HADOOP-3939">HADOOP-3939</a>. (Tsz Wo
(Nicholas), SZE via cdouglas)
+</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4257">HADOOP-4257</a>.
The DFS client should pick only one datanode as the candidate
+to initiate lease recovery.  (Tsz Wo (Nicholas), SZE via dhruba)
+</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4319">HADOOP-4319</a>.
fuse-dfs dfs_read function returns as many bytes as it is
+told to read unlesss end-of-file is reached.<br />(Pete Wyckoff via dhruba)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4246">HADOOP-4246</a>.
Ensure we have the correct lower bound on the number of
+retries for fetching map-outputs; also fixed the case where the reducer
+automatically kills on too many unique map-outputs could not be fetched
+for small jobs.<br />(Amareshwari Sri Ramadasu via acmurthy)</li>
     </ol>
   </li>
 </ul>
-<h2><a href="javascript:toggleList('release_0.18.1_-_2008-09-17_')">Release 0.18.1
- 2008-09-17
+<h2><a href="javascript:toggleList('release_0.18.2_-_unreleased_')">Release 0.18.2
- Unreleased
 </a></h2>
+<ul id="release_0.18.2_-_unreleased_">
+  <li><a href="javascript:toggleList('release_0.18.2_-_unreleased_._bug_fixes_')">
 BUG FIXES
+</a>&nbsp;&nbsp;&nbsp;(3)
+    <ol id="release_0.18.2_-_unreleased_._bug_fixes_">
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4116">HADOOP-4116</a>.
Balancer should provide better resource management.<br />(hairong)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-3614">HADOOP-3614</a>.
Fix a bug that Datanode may use an old GenerationStamp to get
+meta file.<br />(szetszwo)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4314">HADOOP-4314</a>.
Simulated datanodes should not include blocks that are still
+being written in their block report.<br />(Raghu Angadi)</li>
+    </ol>
+  </li>
+</ul>
+<h2><a href="javascript:toggleList('older')">Older Releases</a></h2>
+<ul id="older">
+<h3><a href="javascript:toggleList('release_0.18.1_-_2008-09-17_')">Release 0.18.1
- 2008-09-17
+</a></h3>
 <ul id="release_0.18.1_-_2008-09-17_">
   <li><a href="javascript:toggleList('release_0.18.1_-_2008-09-17_._improvements_')">
 IMPROVEMENTS
 </a>&nbsp;&nbsp;&nbsp;(1)
@@ -540,8 +613,6 @@
     </ol>
   </li>
 </ul>
-<h2><a href="javascript:toggleList('older')">Older Releases</a></h2>
-<ul id="older">
 <h3><a href="javascript:toggleList('release_0.18.0_-_2008-08-19_')">Release 0.18.0
- 2008-08-19
 </a></h3>
 <ul id="release_0.18.0_-_2008-08-19_">
@@ -1085,6 +1156,21 @@
     </ol>
   </li>
 </ul>
+<h3><a href="javascript:toggleList('release_0.17.3_-_unreleased_')">Release 0.17.3
- Unreleased
+</a></h3>
+<ul id="release_0.17.3_-_unreleased_">
+  <li><a href="javascript:toggleList('release_0.17.3_-_unreleased_._bug_fixes_')">
 BUG FIXES
+</a>&nbsp;&nbsp;&nbsp;(4)
+    <ol id="release_0.17.3_-_unreleased_._bug_fixes_">
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4277">HADOOP-4277</a>.
Checksum verification was mistakenly disabled for
+LocalFileSystem.<br />(Raghu Angadi)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4271">HADOOP-4271</a>.
Checksum input stream can sometimes return invalid
+data to the user.<br />(Ning Li via rangadi)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4318">HADOOP-4318</a>.
DistCp should use absolute paths for cleanup.<br />(szetszwo)</li>
+      <li><a href="http://issues.apache.org/jira/browse/HADOOP-4326">HADOOP-4326</a>.
ChecksumFileSystem does not override create(...) correctly.<br />(szetszwo)</li>
+    </ol>
+  </li>
+</ul>
 <h3><a href="javascript:toggleList('release_0.17.2_-_2008-08-11_')">Release 0.17.2
- 2008-08-11
 </a></h3>
 <ul id="release_0.17.2_-_2008-08-11_">

Modified: hadoop/core/branches/branch-0.19/docs/hadoop-default.html
URL: http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/docs/hadoop-default.html?rev=702166&r1=702165&r2=702166&view=diff
==============================================================================
--- hadoop/core/branches/branch-0.19/docs/hadoop-default.html (original)
+++ hadoop/core/branches/branch-0.19/docs/hadoop-default.html Mon Oct  6 07:38:38 2008
@@ -442,12 +442,15 @@
 </tr>
 <tr>
 <td><a name="mapred.tasktracker.taskmemorymanager.monitoring-interval">mapred.tasktracker.taskmemorymanager.monitoring-interval</a></td><td>5000</td><td>The
interval, in milliseconds, for which the tasktracker waits
-   between two cycles of monitoring its tasks' memory usage.</td>
+   between two cycles of monitoring its tasks' memory usage. Used only if
+   tasks' memory management is enabled via mapred.tasktracker.tasks.maxmemory.
+   </td>
 </tr>
 <tr>
 <td><a name="mapred.tasktracker.procfsbasedprocesstree.sleeptime-before-sigkill">mapred.tasktracker.procfsbasedprocesstree.sleeptime-before-sigkill</a></td><td>5000</td><td>The
time, in milliseconds, the tasktracker waits for sending a
   SIGKILL to a process that has overrun memory limits, after it has been sent
-  a SIGTERM.</td>
+  a SIGTERM. Used only if tasks' memory management is enabled via
+  mapred.tasktracker.tasks.maxmemory.</td>
 </tr>
 <tr>
 <td><a name="mapred.map.tasks">mapred.map.tasks</a></td><td>2</td><td>The
default number of map tasks per job.  Typically set
@@ -467,15 +470,10 @@
   </td>
 </tr>
 <tr>
-<td><a name="mapred.jobtracker.job.history.block.size">mapred.jobtracker.job.history.block.size</a></td><td>0</td><td>The
block size of the job history file. Since the job recovery
+<td><a name="mapred.jobtracker.job.history.block.size">mapred.jobtracker.job.history.block.size</a></td><td>3145728&gt;</td><td>The
block size of the job history file. Since the job recovery
                uses job history, its important to dump job history to disk as 
-               soon as possible.
-  </td>
-</tr>
-<tr>
-<td><a name="mapred.jobtracker.job.history.buffer.size">mapred.jobtracker.job.history.buffer.size</a></td><td>4096</td><td>The
buffer size for the job history file. Since the job 
-               recovery uses job history, its important to frequently flush the 
-               job history to disk. This will minimize the loss in recovery.
+               soon as possible. Note that this is an expert level parameter.
+               The default value is set to 3 MB.
   </td>
 </tr>
 <tr>
@@ -914,7 +912,9 @@
   	tasks. Any task scheduled on this tasktracker is guaranteed and constrained
   	 to use a share of this amount. Any task exceeding its share will be 
   	killed. If set to -1, this functionality is disabled, and 
-  	mapred.task.maxmemory is ignored.
+  	mapred.task.maxmemory is ignored. Further, it will be enabled only on the
+  	systems where org.apache.hadoop.util.ProcfsBasedProcessTree is available,
+  	i.e at present only on Linux.
   </td>
 </tr>
 <tr>

Modified: hadoop/core/branches/branch-0.19/docs/mapred_tutorial.html
URL: http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/docs/mapred_tutorial.html?rev=702166&r1=702165&r2=702166&view=diff
==============================================================================
--- hadoop/core/branches/branch-0.19/docs/mapred_tutorial.html (original)
+++ hadoop/core/branches/branch-0.19/docs/mapred_tutorial.html Mon Oct  6 07:38:38 2008
@@ -319,6 +319,9 @@
 <li>
 <a href="#Data+Compression">Data Compression</a>
 </li>
+<li>
+<a href="#Skipping+Bad+Records">Skipping Bad Records</a>
+</li>
 </ul>
 </li>
 </ul>
@@ -327,7 +330,7 @@
 <a href="#Example%3A+WordCount+v2.0">Example: WordCount v2.0</a>
 <ul class="minitoc">
 <li>
-<a href="#Source+Code-N10F30">Source Code</a>
+<a href="#Source+Code-N10F78">Source Code</a>
 </li>
 <li>
 <a href="#Sample+Runs">Sample Runs</a>
@@ -2542,10 +2545,81 @@
             <a href="api/org/apache/hadoop/mapred/SequenceFileOutputFormat.html#setOutputCompressionType(org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.io.SequenceFile.CompressionType)">
             SequenceFileOutputFormat.setOutputCompressionType(JobConf, 
             SequenceFile.CompressionType)</a> api.</p>
+<a name="N10F14"></a><a name="Skipping+Bad+Records"></a>
+<h4>Skipping Bad Records</h4>
+<p>Hadoop provides an optional mode of execution in which the bad 
+          records are detected and skipped in further attempts. 
+          Applications can control various settings via 
+          <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html">
+          SkipBadRecords</a>.</p>
+<p>This feature can be used when map/reduce tasks crashes 
+          deterministically on certain input. This happens due to bugs in the 
+          map/reduce function. The usual course would be to fix these bugs. 
+          But sometimes this is not possible; perhaps the bug is in third party 
+          libraries for which the source code is not available. Due to this, 
+          the task never reaches to completion even with multiple attempts and 
+          complete data for that task is lost.</p>
+<p>With this feature, only a small portion of data is lost surrounding 
+          the bad record. This may be acceptable for some user applications; 
+          for example applications which are doing statistical analysis on 
+          very large data. By default this feature is disabled. For turning it 
+          on refer <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setMapperMaxSkipRecords(org.apache.hadoop.conf.Configuration,
long)">
+          SkipBadRecords.setMapperMaxSkipRecords(Configuration, long)</a> and 
+          <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setReducerMaxSkipGroups(org.apache.hadoop.conf.Configuration,
long)">
+          SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)</a>.
+          </p>
+<p>The skipping mode gets kicked off after certain no of failures
+          see <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setAttemptsToStartSkipping(org.apache.hadoop.conf.Configuration,
int)">
+          SkipBadRecords.setAttemptsToStartSkipping(Configuration, int)</a>.
+          </p>
+<p>In the skipping mode, the map/reduce task maintains the record 
+          range which is getting processed at all times. For maintaining this 
+          range, the framework relies on the processed record 
+          counter. see <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#COUNTER_MAP_PROCESSED_RECORDS">
+          SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS</a> and 
+          <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#COUNTER_REDUCE_PROCESSED_GROUPS">
+          SkipBadRecords.COUNTER_REDUCE_PROCESSED_GROUPS</a>. 
+          Based on this counter, the framework knows that how 
+          many records have been processed successfully by mapper/reducer.
+          Before giving the 
+          input to the map/reduce function, it sends this record range to the 
+          Task tracker. If task crashes, the Task tracker knows which one was 
+          the last reported range. On further attempts that range get skipped.
+          </p>
+<p>The number of records skipped for a single bad record depends on 
+          how frequent, the processed counters are incremented by the application. 
+          It is recommended to increment the counter after processing every 
+          single record. However in some applications this might be difficult as 
+          they may be batching up their processing. In that case, the framework 
+          might skip more records surrounding the bad record. If users want to 
+          reduce the number of records skipped, then they can specify the 
+          acceptable value using 
+          <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setMapperMaxSkipRecords(org.apache.hadoop.conf.Configuration,
long)">
+          SkipBadRecords.setMapperMaxSkipRecords(Configuration, long)</a> and 
+          <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setReducerMaxSkipGroups(org.apache.hadoop.conf.Configuration,
long)">
+          SkipBadRecords.setReducerMaxSkipGroups(Configuration, long)</a>. 
+          The framework tries to narrow down the skipped range by employing the 
+          binary search kind of algorithm during task re-executions. The skipped
+          range is divided into two halves and only one half get executed. 
+          Based on the subsequent failure, it figures out which half contains 
+          the bad record. This task re-execution will keep happening till 
+          acceptable skipped value is met or all task attempts are exhausted.
+          To increase the number of task attempts, use
+          <a href="api/org/apache/hadoop/mapred/JobConf.html#setMaxMapAttempts(int)">
+          JobConf.setMaxMapAttempts(int)</a> and 
+          <a href="api/org/apache/hadoop/mapred/JobConf.html#setMaxReduceAttempts(int)">
+          JobConf.setMaxReduceAttempts(int)</a>.
+          </p>
+<p>The skipped records are written to the hdfs in the sequence file 
+          format, which could be used for later analysis. The location of 
+          skipped records output path can be changed by 
+          <a href="api/org/apache/hadoop/mapred/SkipBadRecords.html#setSkipOutputPath(org.apache.hadoop.mapred.JobConf,
org.apache.hadoop.fs.Path)">
+          SkipBadRecords.setSkipOutputPath(JobConf, Path)</a>.
+          </p>
 </div>
 
     
-<a name="N10F16"></a><a name="Example%3A+WordCount+v2.0"></a>
+<a name="N10F5E"></a><a name="Example%3A+WordCount+v2.0"></a>
 <h2 class="h3">Example: WordCount v2.0</h2>
 <div class="section">
 <p>Here is a more complete <span class="codefrag">WordCount</span> which
uses many of the
@@ -2555,7 +2629,7 @@
       <a href="quickstart.html#SingleNodeSetup">pseudo-distributed</a> or
       <a href="quickstart.html#Fully-Distributed+Operation">fully-distributed</a>

       Hadoop installation.</p>
-<a name="N10F30"></a><a name="Source+Code-N10F30"></a>
+<a name="N10F78"></a><a name="Source+Code-N10F78"></a>
 <h3 class="h4">Source Code</h3>
 <table class="ForrestTable" cellspacing="1" cellpadding="4">
           
@@ -3765,7 +3839,7 @@
 </tr>
         
 </table>
-<a name="N11692"></a><a name="Sample+Runs"></a>
+<a name="N116DA"></a><a name="Sample+Runs"></a>
 <h3 class="h4">Sample Runs</h3>
 <p>Sample text-files as input:</p>
 <p>
@@ -3933,7 +4007,7 @@
 <br>
         
 </p>
-<a name="N11766"></a><a name="Highlights"></a>
+<a name="N117AE"></a><a name="Highlights"></a>
 <h3 class="h4">Highlights</h3>
 <p>The second version of <span class="codefrag">WordCount</span> improves
upon the 
         previous one by using some features offered by the Map/Reduce framework:



Mime
View raw message