Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Thu, 12 Jan 2012 03:43:40 +0000 (UTC)
From: "Zhihong Yu (Updated) (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: 
 <395995876.33165.1326339820216.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <106156915.23830.1326156225580.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Updated] (HBASE-5163)
 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or
 hadoop QA ("The directory is already locked.")
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu updated HBASE-5163:
------------------------------

    Affects Version/s:     (was: 0.94.0)
                       0.92.0
        Fix Version/s: 0.94.0
                       0.92.0
         Hadoop Flags: Reviewed
    
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.92.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.92.0, 0.94.0
>
>         Attachments: 5163-92.txt, 5163.patch
>
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira