hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-7520) org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted fails when I cd hbase-it and mvn verify
Date Wed, 23 Jan 2013 02:52:14 GMT

    [ https://issues.apache.org/jira/browse/HBASE-7520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13560343#comment-13560343
] 

Sergey Shelukhin commented on HBASE-7520:
-----------------------------------------

There are 2 separate issues there (at least in the failure I am currently looking at).
1) Retries are not smart. Gradual fallback doesn't make a lot of sense when we are going to
different servers. E.g. if we just had 8 retries from server FOO, and now read that region
is on BAR in meta, no sense to wait 30 seconds before going to BAR. I'll file a JIRA.
2) There's some sort of bug in assignment that necessitates the timeoutmonitor to kick in.
Here's the analysis.

Cruft removed from logs for convenience.
Note that "Sending ... " log is BEFORE the delay.
Servers are indicated by ports in the text.

Here the request starts. Some normal test errors we are supposed to recover from happen.
Region is closing (being moved) on 51231, then we get RegionMoved after retry to 51231, then
we get error from destination 50661 as region is still opening.
{code}
2013-01-22 17:59:00,537  Retrying due to errors: Error #0 from [10.11.2.92:51231] for [6b03eab4f067cf8739e85bd408f05d94-27118]org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.
is closing
2013-01-22 17:59:00,571  Sending requests to [10.11.2.92:51231] with delay of [1001] for rows
[6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 17:59:01,577  Retrying due to errors: Error #0 from [10.11.2.92:51231] for [6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
org.apache.hadoop.hbase.RegionMovedException: Region moved to: hostname=10.11.2.92 port=50661.
As of locationSeqNum=129492.
2013-01-22 17:59:01,577  Sending requests to [10.11.2.92:50661] with delay of [1009] for rows
[6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 17:59:02,591  Retrying due to errors: Error #0 from [10.11.2.92:50661] for [6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException:
Region is not online: 0200b366bc37c5afd1185f7d487c7dfb
2013-01-22 17:59:02,591  Sending requests to [10.11.2.92:51231] with delay of [2002] for rows
[6b03eab4f067cf8739e85bd408f05d94-27118;]
{code}

ChaosMonkey kills the server which the region is moving to (50661). 
We, meanwhile, cycle back to 51231 based on meta, get redirect again, and go to 50661 again.
But it's now dead.
{code}
2013-01-22 17:59:03,121 INFO  [Thread-506] util.ChaosMonkey$Action(179): Killing region server:10.11.2.92,50661,1358906192942
2013-01-22 17:59:04,594  Retrying due to errors: Error #0 from [10.11.2.92:51231] for [6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
org.apache.hadoop.hbase.RegionMovedException: Region moved to: hostname=10.11.2.92 port=50661.
As of locationSeqNum=129492.
2013-01-22 17:59:04,595  Sending requests to [10.11.2.92:50661] with delay of [2014] for rows
[6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 17:59:06,609  Retrying due to errors: Error #0 from [10.11.2.92:50661] for [6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is in the failed
servers list: /10.11.2.92:50661
2013-01-22 17:59:06,616  Sending requests to [10.11.2.92:50661] with delay of [4004] for rows
[6b03eab4f067cf8739e85bd408f05d94-27118;]
{code}

At this point we keep retrying against 50661, which is in META.
{code}
2013-01-22 17:59:08,631 Current INFO from scan results = {NAME =&gt; &apos;IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.&apos;,
STARTKEY =&gt; &apos;66666660&apos;, ENDKEY =&gt; &apos;7333332c&apos;,
ENCODED =&gt; 0200b366bc37c5afd1185f7d487c7dfb,}
2013-01-22 17:59:08,631 Cached location for IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.
is 10.11.2.92:50661
2013-01-22 17:59:10,621  Retrying due to errors: Error #0 from [10.11.2.92:50661] for [6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is in the failed
servers list: /10.11.2.92:50661
2013-01-22 17:59:10,626  Sending requests to [10.11.2.92:50661] with delay of [4025] for rows
[6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 17:59:14,653  Retrying due to errors: Error #0 from [10.11.2.92:50661] for [6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is in the failed
servers list: /10.11.2.92:50661
2013-01-22 17:59:14,662  Sending requests to [10.11.2.92:50661] with delay of [8026] for rows
[6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 17:59:22,689  Retrying due to errors: Error #0 from [10.11.2.92:50661] for [6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is in the failed
servers list: /10.11.2.92:50661
2013-01-22 17:59:22,699  Sending requests to [10.11.2.92:50661] with delay of [16151] for
rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 17:59:28,748 Current INFO from scan results = {NAME =&gt; &apos;IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.&apos;,
STARTKEY =&gt; &apos;66666660&apos;, ENDKEY =&gt; &apos;7333332c&apos;,
ENCODED =&gt; 0200b366bc37c5afd1185f7d487c7dfb,}
2013-01-22 17:59:28,748 Cached location for IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.
is 10.11.2.92:50661
{code}

Master thinks assignment timed out, despite region being in meta!
{code}
2013-01-22 17:59:34,011 INFO  [10.11.2.92,50658,1358906192673.timeoutMonitor] master.AssignmentManager$TimeoutMonitor(2560):
Regions in transition timed out:  {IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.
state=OPENING, ts=0, server=10.11.2.92,50661,1358906192942}
{code}

Region is opened on 52058 thanks to timeout monitor, and we finally get correct location.

However, by this time the combination of the above delay and growing request delay exceed
ChaosMonkey period, so it moves the region again and we finally run out of retries.
{code}
2013-01-22 17:59:38,852  Retrying due to errors: Error #0 from [10.11.2.92:50661] for [6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
java.net.ConnectException: Connection refused
2013-01-22 17:59:38,860  Sending requests to [10.11.2.92:52068] with delay of [32199] for
rows [6b03eab4f067cf8739e85bd408f05d94-27118;]
2013-01-22 18:00:05,481 DEBUG [Thread-506] util.ChaosMonkey$Action(203): Removing 2 regions
from 10.11.2.92,52068,1358906350270
2013-01-22 18:00:11,070  Failed to insert: 27118; region information: cached: region=IntegrationTestRebalanceAndKillServersTargeted,66666660,1358906196709.0200b366bc37c5afd1185f7d487c7dfb.,
hostname=10.11.2.92, port=51015, seqNum=148999; cache is up to date; errors: Error #0 from
[10.11.2.92:52068] for [6b03eab4f067cf8739e85bd408f05d94-27118]java.util.concurrent.ExecutionException:
org.apache.hadoop.hbase.RegionMovedException: Region moved to: hostname=10.11.2.92 port=51015.
As of locationSeqNum=148999.
{code}


                
> org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted fails when I cd
hbase-it and mvn verify
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-7520
>                 URL: https://issues.apache.org/jira/browse/HBASE-7520
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>         Environment: macosx trunk
>            Reporter: stack
>            Assignee: Sergey Shelukhin
>            Priority: Critical
>
> Trying to make up something to hand off to bigtop project, running the hbase it tests,
this one fails.
> {code}
> durruti:failsafe-reports stack$ more org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted.txt

> -------------------------------------------------------------------------------
> Test set: org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted
> -------------------------------------------------------------------------------
> Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 206.538 sec <<<
FAILURE!
> testDataIngest(org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted)
 Time elapsed: 206.395 sec  <<< FAILURE!
> junit.framework.AssertionFailedError: Load failed with error code 1
>         at junit.framework.Assert.fail(Assert.java:50)
>         at org.apache.hadoop.hbase.IngestIntegrationTestBase.runIngestTest(IngestIntegrationTestBase.java:98)
>         at org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted.testDataIngest(IntegrationTestRebalanceAndKillServersTargeted.java:121)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> ...
> {code}
> org.apache.hadoop.hbase.IntegrationTestRebalanceAndKillServersTargeted-output.txt  has
nothing in it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message