hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Drob (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-17922) TestRegionServerHostname always fails against hadoop 3.0.0-alpha2
Date Wed, 12 Jul 2017 06:06:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-17922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16083491#comment-16083491
] 

Mike Drob commented on HBASE-17922:
-----------------------------------

Chatted with [~appy] about this offline a bit...

It looks like the problem here is that when TestUtil fails to start a region server, something
in the JVM breaks. His concern was that even if it's a bug with TestUtil, we might still be
uncovering a real issue with Hadoop 3 integration, and maybe changing the test will go back
to masking the problem.

This took me way too long to figure out because I had to wire up a bunch of reflection to
start examining HDFS internals, but I think I finally caught the root cause here.

Here is the minimal test case that fails with the same error as we're seeing here:

{noformat}
  @Test (timeout=15000)
  public void testStartStopStart() throws Exception {
    TEST_UTIL.startMiniDFSCluster(1);
    TEST_UTIL.shutdownMiniDFSCluster();
    TEST_UTIL.startMiniCluster(1, 1);
  }
{noformat}

What happens is that the first time we start up a DFS cluster, the file system caches get
populated here (line numbers likely off because of the previously mentioned reflection hacks):
{noformat}
	at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:210)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3318)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3275)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:476)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225)
	at org.apache.hadoop.hbase.fs.HFileSystem.<init>(HFileSystem.java:88)
	at org.apache.hadoop.hbase.fs.HFileSystem.get(HFileSystem.java:472)
	at org.apache.hadoop.hbase.HBaseTestingUtility.getTestFileSystem(HBaseTestingUtility.java:3072)
	at org.apache.hadoop.hbase.HBaseTestingUtility.getNewDataTestDirOnTestFS(HBaseTestingUtility.java:576)
	at org.apache.hadoop.hbase.HBaseTestingUtility.setupDataTestDirOnTestFS(HBaseTestingUtility.java:565)
	at org.apache.hadoop.hbase.HBaseTestingUtility.getDataTestDirOnTestFS(HBaseTestingUtility.java:538)
	at org.apache.hadoop.hbase.HBaseTestingUtility.getDataTestDirOnTestFS(HBaseTestingUtility.java:552)
	at org.apache.hadoop.hbase.HBaseTestingUtility.createDirsAndSetProperties(HBaseTestingUtility.java:786)
	at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniDFSCluster(HBaseTestingUtility.java:655)
{noformat}
That is also where the client finalizer shutdown hook is added, which region servers attempt
to suppress.

In normal operation, only a single region server starts per JVM so we can suppress that hook
and everything is good. In our tests, we can start and stop multiple mini clusters, and we
fix the suppression by checking to see if we have already suppressed it. If we have then it's
still registered in our own ShutdownHookManager and we don't need to suppress it again, but
we can increment a refcount.

However, if we start and stop a DFS cluster, then that hook gets cleared on DFS cluster shutdown.

{noformat}
	at org.apache.hadoop.util.ShutdownHookManager.clearShutdownHooks(ShutdownHookManager.java:275)
	at org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:1975)
	at org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:1944)
	at org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:1937)
	at org.apache.hadoop.hbase.HBaseTestingUtility.shutdownMiniDFSCluster(HBaseTestingUtility.java:849)
{noformat}
The second time we start DFS, this hook doesn't get added. I haven't been able to figure out
what exactly gets reused, but the effect is that the hook isn't there, and we don't have a
copy of it that we've saved off, so the whole thing goes boom.

This particular test was triggering the failure because the aborting RegionServer would fail
before the suppression could happen. The hook would get cleaned up by DFS instead of by us,
and later attempts to start the mini cluster wouldn't have the hook available and their RegionServers
would also fail.

I assume that HDFS changed with version 3 to do shutdown hook cleanup in the mini cluster,
and weren't doing this before, but haven't verified that.

> TestRegionServerHostname always fails against hadoop 3.0.0-alpha2
> -----------------------------------------------------------------
>
>                 Key: HBASE-17922
>                 URL: https://issues.apache.org/jira/browse/HBASE-17922
>             Project: HBase
>          Issue Type: Sub-task
>          Components: hadoop3
>    Affects Versions: 2.0.0
>            Reporter: Jonathan Hsieh
>            Assignee: Mike Drob
>             Fix For: 2.0.0-alpha-2
>
>         Attachments: HBASE-17922.patch
>
>
> {code}
> Running org.apache.hadoop.hbase.regionserver.TestRegionServerHostname
> Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 126.363 sec <<<
FAILURE! - in org.apache.hadoop.hbase.regionserver.TestRegionServerHostname
> testRegionServerHostname(org.apache.hadoop.hbase.regionserver.TestRegionServerHostname)
 Time elapsed: 120.029 sec  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 120000 milliseconds
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.hadoop.hbase.util.JVMClusterUtil.startup(JVMClusterUtil.java:221)
> 	at org.apache.hadoop.hbase.LocalHBaseCluster.startup(LocalHBaseCluster.java:405)
> 	at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:225)
> 	at org.apache.hadoop.hbase.MiniHBaseCluster.<init>(MiniHBaseCluster.java:94)
> 	at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:1123)
> 	at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:1077)
> 	at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:948)
> 	at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:942)
> 	at org.apache.hadoop.hbase.regionserver.TestRegionServerHostname.testRegionServerHostname(TestRegionServerHostname.java:88)
> Results :
> Tests in error: 
>   TestRegionServerHostname.testRegionServerHostname:88 ยป TestTimedOut test timed...
> Tests run: 2, Failures: 0, Errors: 1, Skipped: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message