hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-9387) Region could get lost during assignment
Date Fri, 30 Aug 2013 21:26:52 GMT

    [ https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13755153#comment-13755153
] 

stack commented on HBASE-9387:
------------------------------

You should probably set stopped state too if you call abort in your MockRegionServer since
abort always calls stop.

This change is now gratuitous, right?

@@ -438,9 +439,10 @@
           EventType.M_ZK_REGION_OFFLINE,
           EventType.RS_ZK_REGION_FAILED_OPEN,
           versionOfOfflineNode) == -1) {
-        LOG.warn("Unable to mark region " + hri + " as FAILED_OPEN. " +
+        String warnMsg = "Unable to mark region " + hri + " as FAILED_OPEN. " +
             "It's likely that the master already timed out this open " +
-            "attempt, and thus another RS already has the region.");
+            "attempt, and thus another RS already has the region.";
+        LOG.warn(warnMsg);
       } else {
         result = true;
       }

On the test change, how I know it replicates what we saw here?  I started to dig but it was
taking too long.  Would expect comment to explain why we expect RS to abort.  Would expect
to see explain why the yanking of znode is not same as master removing it on successful open.

                
> Region could get lost during assignment
> ---------------------------------------
>
>                 Key: HBASE-9387
>                 URL: https://issues.apache.org/jira/browse/HBASE-9387
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 0.95.2
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>            Priority: Critical
>         Attachments: 9387-v1.txt, 9387-v3.txt, 9387-v4.2.txt, 9387-v4.3.txt, 9387-v4.txt,
9387-v5.txt, hbase-9387.patch, org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt
>
>
> I observed test timeout running against hadoop 2.1.0 with distributed log replay turned
on.
> Looks like region state for 1588230740 became inconsistent between master and the surviving
region server:
> {code}
> 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] master.RegionStates(299): Onlined
1588230740 on kiyo.gq1.ygridcore.net,57016,1377814510039
> ...
> 2013-08-29 22:15:34,587 DEBUG [Thread-221] client.HConnectionManager$HConnectionImplementation(1269):
locateRegionInMeta parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740,
hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 35 failed; retrying
after sleep of 302 because: org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region
is being opened: 1588230740
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
>         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
>         at org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
>         at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
>         at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message