Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 58893BB3A for ; Thu, 12 Jan 2012 03:44:19 +0000 (UTC) Received: (qmail 95822 invoked by uid 500); 12 Jan 2012 03:44:17 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 95719 invoked by uid 500); 12 Jan 2012 03:44:07 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 95687 invoked by uid 99); 12 Jan 2012 03:44:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Jan 2012 03:44:01 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Jan 2012 03:44:00 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 3256E146FD9 for ; Thu, 12 Jan 2012 03:43:40 +0000 (UTC) Date: Thu, 12 Jan 2012 03:43:40 +0000 (UTC) From: "Zhihong Yu (Updated) (JIRA)" To: issues@hbase.apache.org Message-ID: <395995876.33165.1326339820216.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <106156915.23830.1326156225580.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.") MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5163: ------------------------------ Affects Version/s: (was: 0.94.0) 0.92.0 Fix Version/s: 0.94.0 0.92.0 Hadoop Flags: Reviewed > TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.") > ---------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-5163 > URL: https://issues.apache.org/jira/browse/HBASE-5163 > Project: HBase > Issue Type: Bug > Components: test > Affects Versions: 0.92.0 > Environment: all > Reporter: nkeywal > Assignee: nkeywal > Priority: Minor > Fix For: 0.92.0, 0.94.0 > > Attachments: 5163-92.txt, 5163.patch > > > The stack is typically: > {noformat} > java.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. > at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602) > at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455) > at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111) > at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376) > at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:290) > at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553) > at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492) > at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467) > at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417) > at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460) > at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470) > // ... > {noformat} > It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time. > This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops: > {noformat} > for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) { > if (manageDfsDirs) { > File dir1 = new File(data_dir, "data"+(2*i+1)); > File dir2 = new File(data_dir, "data"+(2*i+2)); > dir1.mkdirs(); > dir2.mkdirs(); > // [...] > {noformat} > This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior: > {noformat} > @Test > public void testMiniDFSCluster_startDataNode() throws Exception { > assertTrue( dfsCluster.getDataNodes().size() == 2 ); > // Works, as we kill the last datanode, we can now start a datanode > dfsCluster.stopDataNode(1); > dfsCluster > .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); > // Fails, as it's not the last datanode, the directory will conflict on > // creation > dfsCluster.stopDataNode(0); > try { > dfsCluster > .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); > fail("There should be an exception because the directory already exists"); > } catch (IOException e) { > assertTrue( e.getMessage().contains("The directory is already locked.")); > LOG.info("Expected (!) exception caught " + e.getMessage()); > } > // Works, as we kill the last datanode, we can now restart 2 datanodes > // This makes us back with 2 nodes > dfsCluster.stopDataNode(0); > dfsCluster > .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null); > } > {noformat} > And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do > {noformat} > DatanodeInfo[] pipeline = getPipeline(log); > assertTrue(pipeline.length == fs.getDefaultReplication()); > {noformat} > and then kill the datanodes in the pipeline, we will have: > - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new datanode that will reuse the available 2's directory. > - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'. > There are two ways of fixing the test: > 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure that the directory names will not be reused. But I wonder if there is not a testCase somewhere (may be not in HBase) depending on this behavior. > 2) Kill explicitly the first and second datanode without using the pipeline to be sure that the names won't conflict. > Feedback welcome and the choice to make here... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira