hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From omal...@apache.org
Subject svn commit: r1159713 - in /hadoop/common/branches/branch-0.20-security: ./ CHANGES.txt src/docs/releasenotes.html src/docs/relnotes.py
Date Fri, 19 Aug 2011 17:51:46 GMT
Author: omalley
Date: Fri Aug 19 17:51:45 2011
New Revision: 1159713

URL: http://svn.apache.org/viewvc?rev=1159713&view=rev
Synchronize the relnotes and CHANGES.txt from 204.

MR-2187, 2314, and 2705 were in the 204 list incorrectly.

Also added the new first pass of the release note script.

      - copied unchanged from r1154413, hadoop/common/branches/branch-0.20-security-204/src/docs/relnotes.py
    hadoop/common/branches/branch-0.20-security/   (props changed)
    hadoop/common/branches/branch-0.20-security/CHANGES.txt   (contents, props changed)

Propchange: hadoop/common/branches/branch-0.20-security/
--- svn:mergeinfo (original)
+++ svn:mergeinfo Fri Aug 19 17:51:45 2011
@@ -1,5 +1,5 @@

Modified: hadoop/common/branches/branch-0.20-security/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-security/CHANGES.txt?rev=1159713&r1=1159712&r2=1159713&view=diff
--- hadoop/common/branches/branch-0.20-security/CHANGES.txt (original)
+++ hadoop/common/branches/branch-0.20-security/CHANGES.txt Fri Aug 19 17:51:45 2011
@@ -9,6 +9,10 @@ Release - unreleased
+    MAPREDUCE-2324. Removed usage of broken
+    ResourceEstimator.getEstimatedReduceInputSize to check against usable
+    disk-space on TaskTracker. (Robert Evans via acmurthy) 
     MAPREDUCE-2729. Ensure jobs with reduces which can't be launched due to
     slow-start do not count for user-limits. (Sherry Chen via acmurthy) 
@@ -40,6 +44,12 @@ Release - unreleased
+    MAPREDUCE-2187. Reporter sends progress during sort/merge. (Anupam Seth via
+    acmurthy) 
+    MAPREDUCE-2705. Implements launch of multiple tasks concurrently.
+    (Thomas Graves via ddas)
     MAPREDUCE-7343. Make the number of warnings accepted by test-patch
     configurable to limit false positives. (Thomas Graves via cdouglas)
@@ -52,8 +62,8 @@ Release - unreleased
     HADOOP-7314. Add support for throwing UnknownHostException when a host
     doesn't resolve. Needed for MAPREDUCE-2489. (Jeffrey Naisbitt via mattf)
-    MAPREDUCE-2494. Make the distributed cache delete entires using LRU priority
-    (Robert Joseph Evans via mahadev)
+    MAPREDUCE-2494. Make the distributed cache delete entires using LRU 
+    priority (Robert Joseph Evans via mahadev)
     HADOOP-6889. Make RPC to have an option to timeout - backport to 
     0.20-security. (John George and Ravi Prakash via mattf)
@@ -82,12 +92,8 @@ Release - unreleased
-    MAPREDUCE-2324. Removed usage of broken
-    ResourceEstimator.getEstimatedReduceInputSize to check against usable
-    disk-space on TaskTracker. (Robert Evans via acmurthy) 
-    MAPREDUCE-2621. TestCapacityScheduler fails with "Queue "q1" does not exist".
-    (Sherry Chen via mahadev)
+    MAPREDUCE-2621. TestCapacityScheduler fails with "Queue "q1" does not 
+    exist". (Sherry Chen via mahadev)
     HADOOP-7475. Fix hadoop-setup-single-node.sh to reflect new layout. (eyang
     via omalley)
@@ -213,10 +219,10 @@ Release - unreleased
     HDFS-2057. Wait time to terminate the threads causes unit tests to
     take longer time. (Bharath Mundlapudi via suresh)
+    HDFS-2218. Disable TestHdfsProxy.testHdfsProxyInterface in automated test 
+    suite for 0.20-security-204 release. (Matt Foley)
-    MAPREDUCE-2187. Reporter sends progress during sort/merge. (Anupam Seth via
-    acmurthy) 
     HADOOP-7144. Expose JMX metrics via JSON servlet. (Robert Joseph Evans via
@@ -250,11 +256,6 @@ Release - unreleased
     HADOOP-7459. Remove jdk-1.6.0 dependency check from rpm. (omalley)
-    MAPREDUCE-2705. Implements launch of multiple tasks concurrently.
-    (Thomas Graves via ddas)
-Release - Unreleased
     HADOOP-7330. Fix MetricsSourceAdapter to use the value instead of the 
     object. (Luke Lu via omalley)
@@ -280,8 +281,6 @@ Release - 2011-5-11
     HADOOP-7243. Fix contrib unit tests missing dependencies. (omalley)
-    MAPREDUCE-2355. Add a dampner to out-of-band heartbeats. (acmurthy) 
     HADOOP-7190. Add metrics v1 back for backwards compatibility. (omalley)
     MAPREDUCE-2360. Remove stripping of scheme, authority from submit dir in 

Propchange: hadoop/common/branches/branch-0.20-security/CHANGES.txt
--- svn:mergeinfo (original)
+++ svn:mergeinfo Fri Aug 19 17:51:45 2011
@@ -1,6 +1,6 @@

Modified: hadoop/common/branches/branch-0.20-security/src/docs/releasenotes.html
URL: http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-security/src/docs/releasenotes.html?rev=1159713&r1=1159712&r2=1159713&view=diff
--- hadoop/common/branches/branch-0.20-security/src/docs/releasenotes.html (original)
+++ hadoop/common/branches/branch-0.20-security/src/docs/releasenotes.html Fri Aug 19 17:51:45 2011
@@ -20,6 +20,253 @@
 <h4>        Sub-task
+<li> <a href="https://issues.apache.org/jira/browse/MAPREDUCE-2621">MAPREDUCE-2621</a>.
+     Minor bug reported by sherri_chen and fixed by sherri_chen <br>
+     <b>TestCapacityScheduler fails with &quot;Queue &quot;q1&quot; does not exist&quot;</b><br>
+     <blockquote>{quote}<br>Error Message<br><br>Queue &quot;q1&quot; does not exist<br><br>Stacktrace<br><br>java.io.IOException: Queue &quot;q1&quot; does not exist<br>	at org.apache.hadoop.mapred.JobInProgress.&lt;init&gt;(JobInProgress.java:354)<br>	at org.apache.hadoop.mapred.TestCapacityScheduler$FakeJobInProgress.&lt;init&gt;(TestCapacityScheduler.java:172)<br>	at org.apache.hadoop.mapred.TestCapacityScheduler.submitJob(TestCapacityScheduler.java:794)<br>	at org.apache.hadoop.mapred.TestCapacityScheduler.submitJob(TestCapacityScheduler.java:818)<br>	at org.apache.hadoop.mapred.TestCapacityScheduler.submitJobAndInit(TestCapacityScheduler.java:825)<br>	at org.apache.hadoop.mapred.TestCapacityScheduler.testMultiTaskAssignmentInMultipleQueues(TestCapacityScheduler.java:1109)<br>{quote}<br><br>When queue name is invalid, an exception is thrown now. <br><br></blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/MAPREDUCE-2558">MAPREDUCE-2558</a>.
+     Major new feature reported by naisbitt and fixed by naisbitt (jobtracker)<br>
+     <b>Add queue-level metrics 0.20-security branch</b><br>
+     <blockquote>We would like to record and present the jobtracker metrics on a per-queue basis.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/MAPREDUCE-2529">MAPREDUCE-2529</a>.
+     Major bug reported by tgraves and fixed by tgraves (tasktracker)<br>
+     <b>Recognize Jetty bug 1342 and handle it</b><br>
+     <blockquote>We are seeing many instances of the Jetty-1342 (http://jira.codehaus.org/browse/JETTY-1342). The bug doesn&apos;t cause Jetty to stop responding altogether, some fetches go through but a lot of them throw exceptions and eventually fail. The only way we have found to get the TT out of this state is to restart the TT.  This jira is to catch this particular exception (or perhaps a configurable regex) and handle it in an automated way to either blacklist or shutdown the TT after seeing it a configurable number of them.<br></blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/MAPREDUCE-2524">MAPREDUCE-2524</a>.
+     Minor improvement reported by tgraves and fixed by tgraves (tasktracker)<br>
+     <b>Backport trunk heuristics for failing maps when we get fetch failures retrieving map output during shuffle</b><br>
+     <blockquote>The heuristics for failing maps when we get map output fetch failures during the shuffle is pretty conservative in 20. Backport the heuristics from trunk which are more aggressive, simpler, and configurable.<br><br></blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/MAPREDUCE-2514">MAPREDUCE-2514</a>.
+     Trivial bug reported by jeagles and fixed by jeagles (tasktracker)<br>
+     <b>ReinitTrackerAction class name misspelled RenitTrackerAction in task tracker log</b><br>
+     <blockquote></blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/MAPREDUCE-2495">MAPREDUCE-2495</a>.
+     Minor improvement reported by revans2 and fixed by revans2 (distributed-cache)<br>
+     <b>The distributed cache cleanup thread has no monitoring to check to see if it has died for some reason</b><br>
+     <blockquote>The cleanup thread in the distributed cache handles IOExceptions and the like correctly, but just to be a bit more defensive it would be good to monitor the thread, and check that it is still alive regularly, so that the distributed cache does not fill up the entire disk on the node. </blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/MAPREDUCE-2490">MAPREDUCE-2490</a>.
+     Trivial improvement reported by jeagles and fixed by jeagles (jobtracker)<br>
+     <b>Log blacklist debug count</b><br>
+     <blockquote>Gain some insight into blacklist increments/decrements by enhancing the debug logging</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/MAPREDUCE-2479">MAPREDUCE-2479</a>.
+     Major improvement reported by revans2 and fixed by revans2 (tasktracker)<br>
+     <b>Backport MAPREDUCE-1568 to hadoop security branch</b><br>
+     <blockquote></blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/MAPREDUCE-2456">MAPREDUCE-2456</a>.
+     Trivial improvement reported by naisbitt and fixed by naisbitt (jobtracker)<br>
+     <b>Show the reducer taskid and map/reduce tasktrackers for &quot;Failed fetch notification #_ for task attempt...&quot; log messages</b><br>
+     <blockquote>This jira is to provide more useful log information for debugging the &quot;Too many fetch-failures&quot; error.<br><br>Looking at the JobTracker node, we see messages like this:<br>&quot;2010-12-14 00:00:06,911 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #8 for task<br>attempt_201011300729_189729_m_007458_0&quot;.<br><br>I would be useful to see which reducer is reporting the error here.<br><br>So, I propose we add the following to these log messages:<br>  1. reduce task ID<br>  2. TaskTracker nodenames for both the mapper and the reducer<br></blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/MAPREDUCE-2451">MAPREDUCE-2451</a>.
+     Trivial bug reported by tgraves and fixed by tgraves (jobtracker)<br>
+     <b>Log the reason string of healthcheck script</b><br>
+     <blockquote>The information on why a specific TaskTracker got blacklisted is not stored anywhere. The jobtracker web ui will show the detailed reason string until the TT gets unblacklisted.  After that it is lost.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/MAPREDUCE-2447">MAPREDUCE-2447</a>.
+     Minor bug reported by sseth and fixed by sseth <br>
+     <b>Set JvmContext sooner for a task - MR2429</b><br>
+     <blockquote>TaskTracker.validateJVM() is throwing NPE when setupWorkDir() throws IOException. This is because<br>taskFinal.setJvmContext() is not executed yet</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/MAPREDUCE-2443">MAPREDUCE-2443</a>.
+     Minor bug reported by sseth and fixed by sseth (test)<br>
+     <b>Fix FI build - broken after MR-2429</b><br>
+     <blockquote>src/test/system/aop/org/apache/hadoop/mapred/TaskAspect.aj:72 [warning] advice defined in org.apache.hadoop.mapred.TaskAspect has not been applied [Xlint:adviceDidNotMatch]<br><br>After the fix in MR-2429, the call to ping in TaskAspect needs to be fixed.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/MAPREDUCE-2415">MAPREDUCE-2415</a>.
+     Major improvement reported by bharathm and fixed by bharathm (task-controller, tasktracker)<br>
+     <b>Distribute TaskTracker userlogs onto multiple disks</b><br>
+     <blockquote>Currently, userlogs directory in TaskTracker is placed under hadoop.log.dir like &lt;hadoop.log.dir&gt;/userlogs. I am proposing to spread these userlogs onto multiple configured mapred.local.dirs to strengthen TaskTracker reliability w.r.t disk failures.  </blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/MAPREDUCE-2413">MAPREDUCE-2413</a>.
+     Major improvement reported by bharathm and fixed by ravidotg (task-controller, tasktracker)<br>
+     <b>TaskTracker should handle disk failures at both startup and runtime</b><br>
+     <blockquote>At present, TaskTracker doesn&apos;t handle disk failures properly both at startup and runtime.<br><br>(1) Currently TaskTracker doesn&apos;t come up if any of the mapred-local-dirs is on a bad disk. TaskTracker should ignore that particular mapred-local-dir and start up and use only the remaining good mapred-local-dirs.<br>(2) If a disk goes bad while TaskTracker is running, currently TaskTracker doesn&apos;t do anything special. This results in either<br>   (a) TaskTracker continues to &quot;try to use that bad disk&quot; and this results in lots of task failures and possibly job failures(because of multiple TTs having bad disks) and eventually these TTs getting graylisted for all jobs. And this needs manual restart of TT with modified configuration of mapred-local-dirs avoiding the bad disk. OR<br>   (b) Health check script identifying the disk as bad and the TT gets blacklisted. And this also needs manual restart of TT with modified configuration of mapr
 ed-local-dirs avoiding the bad disk.<br><br>This JIRA is to make TaskTracker more fault-tolerant to disk failures solving (1) and (2). i.e. TT should start even if at least one of the mapred-local-dirs is on a good disk and TT should adjust its in-memory list of mapred-local-dirs and avoid using bad mapred-local-dirs.<br></blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/MAPREDUCE-2411">MAPREDUCE-2411</a>.
+     Minor bug reported by dking and fixed by dking <br>
+     <b>When you submit a job to a queue with no ACLs you get an inscrutible NPE</b><br>
+     <blockquote>With this patch we&apos;ll check for that, and print a message in the logs.  Then at submission time you find out about it.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/MAPREDUCE-2409">MAPREDUCE-2409</a>.
+     Major bug reported by sseth and fixed by sseth (distributed-cache)<br>
+     <b>Distributed Cache does not differentiate between file /archive for files with the same path</b><br>
+     <blockquote>If a &apos;global&apos; file is specified as a &apos;file&apos; by one job - subsequent jobs cannot override this source file to be an &apos;archive&apos; (until the TT cleans up it&apos;s cache or a TT restart).<br>The other way around as well -&gt; &apos;archive&apos; to &apos;file&apos;<br><br>In case of an accidental submission using the wrong type - some of the tasks for the second job will end up seeing the source file as an archive, others as a file.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/MAPREDUCE-118">MAPREDUCE-118</a>.
+     Blocker bug reported by amar_kamat and fixed by amareshwari (client)<br>
+     <b>Job.getJobID() will always return null</b><br>
+     <blockquote>JobContext is used for a read-only view of job&apos;s info. Hence all the readonly fields in JobContext are set in the constructor. Job extends JobContext. When a Job is created, jobid is not known and hence there is no way to set JobID once Job is created. JobID is obtained only when the JobClient queries the jobTracker for a job-id., which happens later i.e upon job submission.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HDFS-2218">HDFS-2218</a>.
+     Blocker test reported by mattf and fixed by mattf (contrib/hdfsproxy, test)<br>
+     <b>Disable TestHdfsProxy.testHdfsProxyInterface in automated test suite for 0.20-security-204 release</b><br>
+     <blockquote>To enable release of 0.20-security-204, despite the existence of unsolved bug HDFS-2217, remove this test case for 204.  This is acceptable because HDFS-2217 is believed to be a bug in the test case and/or its interaction with the Hudson environment, not the HdfsProxy functionality.<br><br>To be fixed and restored for the next release.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HDFS-2057">HDFS-2057</a>.
+     Major bug reported by bharathm and fixed by bharathm (data-node)<br>
+     <b>Wait time to terminate the threads causing unit tests to take longer time</b><br>
+     <blockquote>As a part of datanode process hang, this part of code was introduced in 0.20.204 to clean up all the waiting threads.<br><br>-      try {<br>-          readPool.awaitTermination(10, TimeUnit.SECONDS);<br>-      } catch (InterruptedException e) {<br>-       LOG.info(&quot;Exception occured in doStop:&quot; + e.getMessage());<br>-      }<br>-      readPool.shutdownNow();<br><br>This was clearly meant for production, but all the unit tests uses minidfscluster and minimrcluster for shutdown which waits on this part of the code. Due to this, we saw increase in unit test run times. So removing this code. <br></blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HDFS-2044">HDFS-2044</a>.
+     Major test reported by mattf and fixed by mattf (test)<br>
+     <b>TestQueueProcessingStatistics failing automatic test due to timing issues</b><br>
+     <blockquote>The test makes assumptions about timing issues that hold true in workstation environments but not in Hudson auto-test.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HDFS-2023">HDFS-2023</a>.
+     Major bug reported by bharathm and fixed by bharathm (data-node)<br>
+     <b>Backport of NPE for File.list and File.listFiles</b><br>
+     <blockquote>Since we have multiple Jira&apos;s in trunk for common and hdfs, I am creating another jira for this issue. <br><br>This patch addresses the following:<br><br>1. Provides FileUtil API for list and listFiles which throws IOException for null cases. <br>2. Replaces most of the code where JDK file API with FileUtil API. </blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HDFS-1878">HDFS-1878</a>.
+     Minor bug reported by mattf and fixed by mattf (name-node)<br>
+     <b>TestHDFSServerPorts unit test failure - race condition in FSNamesystem.close() causes NullPointerException without serious consequence</b><br>
+     <blockquote>In 20.204, TestHDFSServerPorts was observed to intermittently throw a NullPointerException.  This only happens when FSNamesystem.close() is called, which means system termination for the Namenode, so this is not a serious bug for .204.  TestHDFSServerPorts is more likely than normal execution to stimulate the race, because it runs two Namenodes in the same JVM, causing more interleaving and more potential to see a race condition.<br><br>The race is in FSNamesystem.close(), line 566, we have:<br>      if (replthread != null) replthread.interrupt();<br>      if (replmon != null) replmon = null;<br><br>Since the interrupted replthread is not waited on, there is a potential race condition with replmon being nulled before replthread is dead, but replthread references replmon in computeDatanodeWork() where the NullPointerException occurs.<br><br>The solution is either to wait on replthread or just don&apos;t null replmon.  The latter is preferred, since none of th
 e sibling Namenode processing threads are waited on in close().<br><br>I&apos;ll attach a patch for .205.<br></blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HDFS-1836">HDFS-1836</a>.
+     Major bug reported by hkdennis2k and fixed by bharathm (hdfs client)<br>
+     <b>Thousand of CLOSE_WAIT socket </b><br>
+     <blockquote>$ /usr/sbin/lsof -i TCP:50010 | grep -c CLOSE_WAIT<br>4471<br><br>It is better if everything runs normal. <br>However, from time to time there are some &quot;DataStreamer Exception: java.net.SocketTimeoutException&quot; and &quot;DFSClient.processDatanodeError(2507) | Error Recovery for&quot; can be found from log file and the number of CLOSE_WAIT socket just keep increasing<br><br>The CLOSE_WAIT handles may remain for hours and days; then &quot;Too many open file&quot; some day.<br></blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HDFS-1822">HDFS-1822</a>.
+     Blocker bug reported by sureshms and fixed by sureshms (name-node)<br>
+     <b>Editlog opcodes overlap between 20 security and later releases</b><br>
+     <blockquote>Same opcode are used for different operations between 0.20.security, 0.22 and 0.23. This results in failure to load editlogs on later release, especially during upgrades.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HDFS-1773">HDFS-1773</a>.
+     Minor improvement reported by tanping and fixed by tanping (name-node)<br>
+     <b>Remove a datanode from cluster if include list is not empty and this datanode is removed from both include and exclude lists</b><br>
+     <blockquote>Our service engineering team who operates the clusters on a daily basis founds it is confusing that after a data node is decommissioned, there is no way to make the cluster forget about this data node and it always remains in the dead node list.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HDFS-1767">HDFS-1767</a>.
+     Major sub-task reported by mattf and fixed by mattf (data-node)<br>
+     <b>Namenode should ignore non-initial block reports from datanodes when in safemode during startup</b><br>
+     <blockquote>Consider a large cluster that takes 40 minutes to start up.  The datanodes compete to register and send their Initial Block Reports (IBRs) as fast as they can after startup (subject to a small sub-two-minute random delay, which isn&apos;t relevant to this discussion).  <br><br>As each datanode succeeds in sending its IBR, it schedules the starting time for its regular cycle of reports, every hour (or other configured value of dfs.blockreport.intervalMsec). In order to spread the reports evenly across the block report interval, each datanode picks a random fraction of that interval, for the starting point of its regular report cycle.  For example, if a particular datanode ends up randomly selecting 18 minutes after the hour, then that datanode will send a Block Report at 18 minutes after the hour every hour as long as it remains up.  Other datanodes will start their cycles at other randomly selected times.  This code is in DataNode.blockReport() and DataNode.
 scheduleBlockReport().<br><br>The &quot;second Block Report&quot; (2BR), is the start of these hourly reports.  The problem is that some of these 2BRs get scheduled sooner rather than later, and actually occur within the startup period.  For example, if the cluster takes 40 minutes (2/3 of an hour) to start up, then out of the datanodes that succeed in sending their IBRs during the first 10 minutes, between 1/2 and 2/3 of them will send their 2BR before the 40-minute startup time has completed!<br><br>2BRs sent within the startup time actually compete with the remaining IBRs, and thereby slow down the overall startup process.  This can be seen in the following data, which shows the startup process for a 3700-node cluster that took about 17 minutes to finish startup:<br><br>{noformat}<br>      time    starts  sum   regs  sum   IBR  sum  2nd_BR sum total_BRs/min<br>0   1299799498  3042  3042  1969  1969  151   151          0  151<br>1   1299799558   665  3707  1470  3439  248 
   399          0  248<br>2   1299799618        3707   224  3663  270   669          0  270<br>3   1299799678        3707    14  3677  261   930    3     3  264<br>4   1299799738        3707    23  3700  288  1218    1     4  289<br>5   1299799798        3707     7  3707  258  1476    3     7  261<br>6   1299799858        3707        3707  317  1793    4    11  321<br>7   1299799918        3707        3707  292  2085    6    17  298<br>8   1299799978        3707        3707  292  2377    8    25  300<br>9   1299800038        3707        3707  272  2649         25  272<br>10  1299800098        3707        3707  280  2929   15    40  295<br>11  1299800158        3707        3707  223  3152   14    54  237<br>12  1299800218        3707        3707  143  3295         54  143<br>13  1299800278        3707        3707  141  3436   20    74  161<br>14  1299800338        3707        3707  195  3631   78   152  273<br>15  1299800398        3707        3707   51  3682  209   361  260<b
 r>16  1299800458        3707        3707   25  3707  369   730  394<br>17  1299800518        3707        3707       3707  166   896  166<br>18  1299800578        3707        3707       3707   72   968   72<br>19  1299800638        3707        3707       3707   67  1035   67<br>20  1299800698        3707        3707       3707   75  1110   75<br>21  1299800758        3707        3707       3707   71  1181   71<br>22  1299800818        3707        3707       3707   67  1248   67<br>23  1299800878        3707        3707       3707   62  1310   62<br>24  1299800938        3707        3707       3707   56  1366   56<br>25  1299800998        3707        3707       3707   60  1426   60<br>{noformat}<br><br>This data was harvested from the startup logs of all the datanodes, and correlated into one-minute buckets.  Each row of the table represents the progress during one elapsed minute of clock time.  It seems that every cluster startup is different, but this one showed the effect f
 airly well.<br><br>The &quot;starts&quot; column shows that all the nodes started up within the first 2 minutes, and the &quot;regs&quot; column shows that all succeeded in registering by minute 6.  The IBR column shows a sustained rate of Initial Block Report processing of 250-300/minute for the first 10 minutes.<br><br>The question is why, during minutes 11 through 16, the rate of IBR processing slowed down.  Why didn&apos;t the startup just finish?  In the &quot;2nd_BR&quot; column, we see the rate of 2BRs ramping up as more datanodes complete their IBRs.  As the rate increases, they become more effective at competing with the IBRs, and slow down the IBR processing even more.  After the IBRs finally finish in minute 16, the rate of 2BRs settles down to a steady ~60-70/minute.<br><br>In order to decrease competition for locks and other resources, to speed up IBR processing during startup, we propose to delay 2BRs until later into the cycle.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HDFS-1758">HDFS-1758</a>.
+     Minor bug reported by tanping and fixed by tanping (tools)<br>
+     <b>Web UI JSP pages thread safety issue</b><br>
+     <blockquote>The set of JSP pages that web UI uses are not thread safe.  We have observed some problems when requesting Live/Dead/Decommissioning pages from the web UI, incorrect page is displayed.  To be more specific, requesting Dead node list page, sometimes, Live node page is returned.  Requesting decommissioning page, sometimes, dead page is returned.<br><br>The root cause of this problem is that JSP page is not thread safe by default.  When multiple requests come in,  each request is assigned to a different thread, multiple threads access the same instance of the servlet class resulted from a JSP page.  A class variable is shared by multiple threads.  The JSP code in 20 branche, for example, dfsnodelist.jsp has<br>{code}<br>&lt;!%<br>  int rowNum = 0;<br>  int colNum = 0;<br>  String sorterField = null;<br>  String sorterOrder = null;<br>  String whatNodes = &quot;LIVE&quot;;<br>  ...<br>%&gt;<br>{code}<br><br>declared as  class variables.  ( These set of variables
  are declared within &lt;%! code %&gt; directives which made them class members. )  Multiple threads share the same set of class member variables, one request would step on anther&apos;s toe. <br><br>However, due to the JSP code refactor, HADOOP-5857, all of these class member variables are moved to become function local variables.  So this bug does not appear in Apache trunk.  Hence, we have proposed to take a simple fix for this bug on 20 branch alone, to be more specific, branch-0.20-security.<br><br>The simple fix is to add jsp ThreadSafe=&quot;false&quot; directive into the related JSP pages, dfshealth.jsp and dfsnodelist.jsp to make them thread safe, i.e. only on request is processed at each time. <br><br>We did evaluate the thread safety issue for other JSP pages on trunk, we noticed a potential problem is that when we retrieving some statistics from namenode, for example, we make the call to <br>{code}<br>NamenodeJspHelper.getInodeLimitText(fsn);<br>{code}<br>in dfsh
 ealth.jsp, which eventuality is <br><br>{code}<br>  static String getInodeLimitText(FSNamesystem fsn) {<br>    long inodes = fsn.dir.totalInodes();<br>    long blocks = fsn.getBlocksTotal();<br>    long maxobjects = fsn.getMaxObjects();<br>    ....<br>{code}<br><br>some of the function calls are already guarded by readwritelock, e.g. dir.totalInodes, but others are not.  As a result of this, the web ui results are not 100% thread safe.  But after evaluating the prons and cons of adding a giant lock into the JSP pages, we decided not to issue FSNamesystem ReadWrite locks into JSPs.<br><br></blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HDFS-1750">HDFS-1750</a>.
+     Major bug reported by szetszwo and fixed by szetszwo <br>
+     <b>fs -ls hftp://file not working</b><br>
+     <blockquote>{noformat}<br>hadoop dfs -touchz /tmp/file1 # create file. OK<br>hadoop dfs -ls /tmp/file1  # OK<br>hadoop dfs -ls hftp://namenode:50070/tmp/file1 # FAILED: not seeing the file<br>{noformat}</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HDFS-1692">HDFS-1692</a>.
+     Major bug reported by bharathm and fixed by bharathm (data-node)<br>
+     <b>In secure mode, Datanode process doesn&apos;t exit when disks fail.</b><br>
+     <blockquote>In secure mode, when disks fail more than volumes tolerated, datanode process doesn&apos;t exit properly and it just hangs even though shutdown method is called. <br><br></blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HDFS-1592">HDFS-1592</a>.
+     Major bug reported by bharathm and fixed by bharathm <br>
+     <b>Datanode startup doesn&apos;t honor volumes.tolerated </b><br>
+     <blockquote>Datanode startup doesn&apos;t honor volumes.tolerated for hadoop 20 version.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HDFS-1541">HDFS-1541</a>.
+     Major sub-task reported by hairong and fixed by hairong (name-node)<br>
+     <b>Not marking datanodes dead When namenode in safemode</b><br>
+     <blockquote>In a big cluster, when namenode starts up,  it takes a long time for namenode to process block reports from all datanodes. Because heartbeats processing get delayed, some datanodes are erroneously marked as dead, then later on they have to register again, thus wasting time.<br><br>It would speed up starting time if the checking of dead nodes is disabled when namenode in safemode.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HDFS-1445">HDFS-1445</a>.
+     Major sub-task reported by mattf and fixed by mattf (data-node)<br>
+     <b>Batch the calls in DataStorage to FileUtil.createHardLink(), so we call it once per directory instead of once per file</b><br>
+     <blockquote>It was a bit of a puzzle why we can do a full scan of a disk in about 30 seconds during FSDir() or getVolumeMap(), but the same disk took 11 minutes to do Upgrade replication via hardlinks.  It turns out that the org.apache.hadoop.fs.FileUtil.createHardLink() method does an outcall to Runtime.getRuntime().exec(), to utilize native filesystem hardlink capability.  So it is forking a full-weight external process, and we call it on each individual file to be replicated.<br><br>As a simple check on the possible cost of this approach, I built a Perl test script (under Linux on a production-class datanode).  Perl also uses a compiled and optimized p-code engine, and it has both native support for hardlinks and the ability to do &quot;exec&quot;.  <br>-  A simple script to create 256,000 files in a directory tree organized like the Datanode, took 10 seconds to run.<br>-  Replicating that directory tree using hardlinks, the same way as the Datanode, took 12 seconds 
 using native hardlink support.<br>-  The same replication using outcalls to exec, one per file, took 256 seconds!<br>-  Batching the calls, and doing &apos;exec&apos; once per directory instead of once per file, took 16 seconds.<br><br>Obviously, your mileage will vary based on the number of blocks per volume.  A volume with less than about 4000 blocks will have only 65 directories.  A volume with more than 4K and less than about 250K blocks will have 4200 directories (more or less).  And there are two files per block (the data file and the .meta file).  So the average number of files per directory may vary from 2:1 to 500:1.  A node with 50K blocks and four volumes will have 25K files per volume, or an average of about 6:1.  So this change may be expected to take it down from, say, 12 minutes per volume to 2.<br></blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HDFS-1377">HDFS-1377</a>.
+     Blocker bug reported by eli and fixed by eli (name-node)<br>
+     <b>Quota bug for partial blocks allows quotas to be violated </b><br>
+     <blockquote>There&apos;s a bug in the quota code that causes them not to be respected when a file is not an exact multiple of the block size. Here&apos;s an example:<br><br>{code}<br>$ hadoop fs -mkdir /test<br>$ hadoop dfsadmin -setSpaceQuota 384M /test<br>$ ls dir/ | wc -l   # dir contains 101 files<br>101<br>$ du -ms dir        # each is 3mb<br>304	dir<br>$ hadoop fs -put dir /test<br>$ hadoop fs -count -q /test<br>        none             inf       402653184      -550502400            2          101          317718528 hdfs://haus01.sf.cloudera.com:10020/test<br>$ hadoop fs -stat &quot;%o %r&quot; /test/dir/f30<br>134217728 3    # three 128mb blocks<br>{code}<br><br>INodeDirectoryWithQuota caches the number of bytes consumed by it&apos;s children in {{diskspace}}. The quota adjustment code has a bug that causes {{diskspace}} to get updated incorrectly when a file is not an exact multiple of the block size (the value ends up being negative). <br><br>This causes the qu
 ota checking code to think that the files in the directory consumes less space than they actually do, so the verifyQuota does not throw a QuotaExceededException even when the directory is over quota. However the bug isn&apos;t visible to users because {{fs count -q}} reports the numbers generated by INode#getContentSummary which adds up the sizes of the blocks rather than use the cached INodeDirectoryWithQuota#diskspace value.<br><br>In FSDirectory#addBlock the disk space consumed is set conservatively to the full block size * the number of replicas:<br><br>{code}<br>updateCount(inodes, inodes.length-1, 0,<br>    fileNode.getPreferredBlockSize()*fileNode.getReplication(), true);<br>{code}<br><br>In FSNameSystem#addStoredBlock we adjust for this conservative estimate by subtracting out the difference between the conservative estimate and what the number of bytes actually stored was:<br><br>{code}<br>//Updated space consumed if required.<br>INodeFile file = (storedBlock != nul
 l) ? storedBlock.getINode() : null;<br>long diff = (file == null) ? 0 :<br>    (file.getPreferredBlockSize() - storedBlock.getNumBytes());<br><br>if (diff &gt; 0 &amp;&amp; file.isUnderConstruction() &amp;&amp;<br>    cursize &lt; storedBlock.getNumBytes()) {<br>...<br>    dir.updateSpaceConsumed(path, 0, -diff*file.getReplication());<br>{code}<br><br>We do the same in FSDirectory#replaceNode when completing the file, but at a file granularity (I believe the intent here is to correct for the cases when there&apos;s a failure replicating blocks and recovery). Since oldnode is under construction INodeFile#diskspaceConsumed will use the preferred block size  (vs of Block#getNumBytes used by newnode) so we will again subtract out the difference between the full block size and what the number of bytes actually stored was:<br><br>{code}<br>long dsOld = oldnode.diskspaceConsumed();<br>...<br>//check if disk space needs to be updated.<br>long dsNew = 0;<br>if (updateDiskspace &amp;&
 amp; (dsNew = newnode.diskspaceConsumed()) != dsOld) {<br>  try {<br>    updateSpaceConsumed(path, 0, dsNew-dsOld);<br>...<br>{code}<br><br>So in the above example we started with diskspace at 384mb (3 * 128mb) and then we subtract 375mb (to reflect only 9mb raw was actually used) twice so for each file the diskspace for the directory is - 366mb (384mb minus 2 * 375mb). Which is why the quota gets negative and yet we can still write more files.<br><br>So a directory with lots of single block files (if you have multiple blocks on the final partial block ends up subtracting from the diskspace used) ends up having a quota that&apos;s way off.<br><br>I think the fix is to in FSDirectory#replaceNode not have the diskspaceConsumed calculations differ when the old and new INode have the same blocks. I&apos;ll work on a patch which also adds a quota test for blocks that are not multiples of the block size and warns in INodeDirectory#computeContentSummary if the computed size does no
 t reflect the cached value.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HDFS-1258">HDFS-1258</a>.
+     Blocker bug reported by atm and fixed by atm (name-node)<br>
+     <b>Clearing namespace quota on &quot;/&quot; corrupts FS image</b><br>
+     <blockquote>The HDFS root directory starts out with a default namespace quota of Integer.MAX_VALUE. If you clear this quota (using &quot;hadoop dfsadmin -clrQuota /&quot;), the fsimage gets corrupted immediately. Subsequent 2NN rolls will fail, and the NN will not come back up from a restart.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HDFS-1189">HDFS-1189</a>.
+     Major bug reported by xiaokang and fixed by johnvijoe (name-node)<br>
+     <b>Quota counts missed between clear quota and set quota</b><br>
+     <blockquote>HDFS Quota counts will be missed between a clear quota operation and a set quota.<br><br>When setting quota for a dir, the INodeDirectory will be replaced by INodeDirectoryWithQuota and dir.isQuotaSet() becomes true. When INodeDirectoryWithQuota  is newly created, quota counting will be performed. However, when clearing quota, the quota conf is set to -1 and dir.isQuotaSet() becomes false while INodeDirectoryWithQuota will NOT be replaced back to INodeDirectory.<br><br>FSDirectory.updateCount just update the quota count for inodes that isQuotaSet() is true. So after clear quota for a dir, its quota counts will not be updated and it&apos;s reasonable. But when re seting quota for this dir, quota counting will not be performed and some counts will be missed.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HADOOP-7475">HADOOP-7475</a>.
+     Blocker bug reported by eyang and fixed by eyang <br>
+     <b>hadoop-setup-single-node.sh is broken</b><br>
+     <blockquote>When running hadoop-setup-single-node.sh, the system can not find the templates configuration directory:<br><br>{noformat}<br>cat: /usr/libexec/../templates/conf/core-site.xml: No such file or directory<br>cat: /usr/libexec/../templates/conf/hdfs-site.xml: No such file or directory<br>cat: /usr/libexec/../templates/conf/mapred-site.xml: No such file or directory<br>cat: /usr/libexec/../templates/conf/hadoop-env.sh: No such file or directory<br>chown: cannot access `hadoop-env.sh&apos;: No such file or directory<br>chmod: cannot access `hadoop-env.sh&apos;: No such file or directory<br>cp: cannot stat `*.xml&apos;: No such file or directory<br>cp: cannot stat `hadoop-env.sh&apos;: No such file or directory<br>{noformat}</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HADOOP-7398">HADOOP-7398</a>.
+     Major new feature reported by owen.omalley and fixed by owen.omalley <br>
+     <b>create a mechanism to suppress the HADOOP_HOME deprecated warning</b><br>
+     <blockquote>Create a new mechanism to suppress the warning about HADOOP_HOME deprecation.<br><br>I&apos;ll create a HADOOP_HOME_WARN_SUPPRESS environment variable that suppresses the warning.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HADOOP-7373">HADOOP-7373</a>.
+     Major bug reported by owen.omalley and fixed by owen.omalley <br>
+     <b>Tarball deployment doesn&apos;t work with {start,stop}-{dfs,mapred}</b><br>
+     <blockquote>The hadoop-config.sh overrides the variable &quot;bin&quot;, which makes the scripts use libexec for hadoop-daemon(s).</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HADOOP-7364">HADOOP-7364</a>.
+     Major bug reported by tgraves and fixed by tgraves (test)<br>
+     <b>TestMiniMRDFSCaching fails if test.build.dir is set to something other than build/test</b><br>
+     <blockquote>TestMiniMRDFSCaching fails if test.build.dir is set to something other than build/test. </blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HADOOP-7356">HADOOP-7356</a>.
+     Blocker bug reported by eyang and fixed by eyang <br>
+     <b>RPM packages broke bin/hadoop script for hadoop 0.20.205</b><br>
+     <blockquote>hadoop-config.sh has been moved to libexec for binary package, but developers prefers to have hadoop-config.sh in bin.  Hadoo shell scripts should be modified to support both scenarios.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HADOOP-7330">HADOOP-7330</a>.
+     Major bug reported by vicaya and fixed by vicaya (metrics)<br>
+     <b>The metrics source mbean implementation should return the attribute value instead of the object</b><br>
+     <blockquote>The MetricsSourceAdapter#getAttribute in 0.20.203 is returning the attribute object instead of the value.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HADOOP-7324">HADOOP-7324</a>.
+     Blocker bug reported by vicaya and fixed by priyomustafi (metrics)<br>
+     <b>Ganglia plugins for metrics v2</b><br>
+     <blockquote>Although, all metrics in metrics v2 are exposed via the standard JMX mechanisms, most users are using Ganglia to collect metrics.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HADOOP-7277">HADOOP-7277</a>.
+     Minor improvement reported by naisbitt and fixed by naisbitt (build)<br>
+     <b>Add Eclipse launch tasks for the 0.20-security branch</b><br>
+     <blockquote>This is to add the eclipse launchers from HADOOP-5911 to the 0.20 security branch.<br><br>Eclipse has a notion of &quot;run configuration&quot;, which encapsulates what&apos;s needed to run or debug an application. I use this quite a bit to start various Hadoop daemons in debug mode, with breakpoints set, to inspect state and what not.<br><br>This is simply configuration, so no tests are provided. After running &quot;ant eclipse&quot; and refreshing your project, you should see entries in the Run Configurations and Debug Configurations for launching the various hadoop daemons from within eclipse. There&apos;s a template for testing a specific test, and also templates to run all the tests, the job tracker, and a task tracker. It&apos;s likely that some parameters need to be further tweaked to have the same behavior as &quot;ant test&quot;, but for most tests, this works.<br><br>This also requires a small change to build.xml for the eclipse classpath.</blockqu
+<li> <a href="https://issues.apache.org/jira/browse/HADOOP-7274">HADOOP-7274</a>.
+     Minor bug reported by jeagles and fixed by jeagles (util)<br>
+     <b>CLONE - IOUtils.readFully and IOUtils.skipFully have typo in exception creation&apos;s message</b><br>
+     <blockquote>Same fix as for HADOOP-7057 for the Hadoop security branch<br><br>{noformat}<br>        throw new IOException( &quot;Premeture EOF from inputStream&quot;);<br>{noformat}</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HADOOP-7248">HADOOP-7248</a>.
+     Minor improvement reported by cos and fixed by tgraves (build)<br>
+     <b>Have a way to automatically update Eclipse .classpath file when new libs are added to the classpath through Ivy for 0.20-* based sources</b><br>
+     <blockquote>Backport HADOOP-6407 into 0.20 based source trees</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HADOOP-7232">HADOOP-7232</a>.
+     Blocker bug reported by owen.omalley and fixed by owen.omalley (documentation)<br>
+     <b>Fix javadoc warnings</b><br>
+     <blockquote>The javadoc is currently generating 31 warnings.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HADOOP-7144">HADOOP-7144</a>.
+     Major new feature reported by vicaya and fixed by revans2 <br>
+     <b>Expose JMX with something like JMXProxyServlet </b><br>
+     <blockquote>Much of the Hadoop metrics and status info is available via JMX, especially since 0.20.100, and 0.22+ (HDFS-1318, HADOOP-6728 etc.) For operations staff not familiar JMX setup, especially JMX with SSL and firewall tunnelling, the usage can be daunting. Using a JMXProxyServlet (a la Tomcat) to translate JMX attributes into JSON output would make a lot of non-Java admins happy.<br><br>We could probably use Tomcat&apos;s JMXProxyServlet code directly, if it&apos;s already output some standard format (JSON or XML etc.) The code is simple enough to port over and can probably integrate with the common HttpServer as one of the default servelet (maybe /jmx) for the pluggable security.</blockquote></li>
+<li> <a href="https://issues.apache.org/jira/browse/HADOOP-6255">HADOOP-6255</a>.
+     Major new feature reported by owen.omalley and fixed by eyang <br>
+     <b>Create an rpm integration project</b><br>
+     <blockquote>We should be able to create RPMs for Hadoop releases.</blockquote></li>
+<h2>Changes Since Hadoop 0.20.2</h2>
+>>>>>>> .merge-right.r1154413
 <li>[<a href='https://issues.apache.org/jira/browse/HADOOP-6213'>HADOOP-6213</a>] -         Remove commons dependency on commons-cli2

View raw message