hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ramkrishna.s.vasudevan (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-4511) There is data loss when master failovers
Date Mon, 07 Nov 2011 05:04:51 GMT

    [ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145227#comment-13145227
] 

ramkrishna.s.vasudevan commented on HBASE-4511:
-----------------------------------------------

@Stack
Patch looks fine. I have one suggestion
{code}
+      if (isExpiring(expiredServer, currentMetaServer) ||
+          expireIfOnline(currentMetaServer)) {
+        // We are expiring the server that is carrying meta because unreachable
+        // The expiration processing will take care of reassigning meta.
+      }
{code}

As you had clearly told if we are already expiring a server while assigning meta then we will
not be expiring once again. 
So can we rename isExpiring to isAlreadyExpiring()?  Also can we split the conition because
currently the if block is empty.
So we can add isAlreadyExpiring() and if true we can go with expireIfOnline.  Just a thought.
 You can decide Stack. :)
                
> There is data loss when master failovers
> ----------------------------------------
>
>                 Key: HBASE-4511
>                 URL: https://issues.apache.org/jira/browse/HBASE-4511
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.92.0
>            Reporter: gaojinchao
>            Assignee: stack
>            Priority: Minor
>             Fix For: 0.92.0
>
>         Attachments: 4511-v2.txt, 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar,
sketch.txt
>
>
> It goes like this:
> Master crashed ,  at the same time RS with meta is crashing, but RS doesn't eixt.
> Master startups again and finds all living RS. 
> Master verifies the meta failed,  because this RS is crashing.
> Master reassigns the meta, but it doesn't split the Hlog. 
> So some meta data is loss.
> About the logs of a failover test case fail. 
> //It said that we want to kill a RS
> 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): STOPPED:
Killing for unit test
> 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
> RS 192.168.2.102,54385,1317264874629 killed 
> //Rs didn't crash. 
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458):
Registering server found up in zk: 192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232):
Registering server=192.168.2.102,54385,1317264874629
> 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491):
master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because
node does not exist (not an error)
> 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003):
master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server
and set watcher; 192.168.2.102,54383,131726487...
> //Meta verification failed and ressigned the meta. So all the regions in the meta is
loss.
> 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476):
Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException:
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629
not running, aborting
> 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316):
new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003):
master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server
and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476):
Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException:
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629
not running, aborting
> 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316):
new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003):
master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server
and set watcher; 192.168.2.102,54383,131726487...
> 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476):
Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException:
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629
not running, aborting
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316):
new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
> 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264):
master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with
OFFLINE state
> 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233):
master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected,
path=/hbase/unassigned/1028785192
> //It said that Master clean the cluster.
> 2011-09-28 19:54:52,889 INFO  [Master:0;192.168.2.102,54557,1317264885720] master.AssignmentManager(383):
Clean cluster startup. Assigning userregions
> 2011-09-28 19:54:52,889 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(494):
master:54557-0x132b31adbb30005 Deleting any existing unassigned nodes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message