hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhihong Yu (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")
Date Thu, 12 Jan 2012 03:41:39 GMT

     [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Zhihong Yu updated HBASE-5163:
------------------------------

    Attachment: 5163-92.txt

Patch I would integrate to 0.92
                
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The
directory is already locked.")
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.94.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5163-92.txt, 5163.patch
>
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
The directory is already locked." type="java.io.IOException">java.io.IOException: Cannot
lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
The directory is already locked.
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
> 	at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
> 	at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
> 	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
> 	at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other tests in
the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in MiniDFSCluster#startDataNode.
It depends on the number of nodes *currently* in the cluster, and does not take into account
previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop the last
one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when
we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new
datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode,
it fails because it wants to use the same directory as the still alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the
directory names will not be reused. But I wonder if there is not a testCase somewhere (may
be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline to be sure
that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message