hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-11036) Online schema change with region merge may cause data loss
Date Wed, 21 May 2014 02:03:38 GMT

    [ https://issues.apache.org/jira/browse/HBASE-11036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004221#comment-14004221
] 

Enis Soztutar commented on HBASE-11036:
---------------------------------------

bq. Though about META loss, but I could not find anything that indicates that.
Upon further inspection, this seems a data loss on the meta region. I have created HBASE-11217
to track that because the logs in this issue is not that relevant for that root cause. 

> Online schema change with region merge may cause data loss 
> -----------------------------------------------------------
>
>                 Key: HBASE-11036
>                 URL: https://issues.apache.org/jira/browse/HBASE-11036
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Enis Soztutar
>             Fix For: 0.99.0
>
>
> We have found out that online schema change and region merges may still cause issues
about merged regions coming back online, and thus causing data loss. 
> Recently ITBLL failed reporting 800K missing rows out of 720M. We've been running this
test for some extended period of time, and this is the first we are seeing it, meaning that
it is more rare. But it is still concerning. 
> From master's log:
> The merge has happened:
> {code}
> 2014-04-16 18:26:37,247 INFO  [AM.ZK.Worker-pool2-t73] master.AssignmentManager: Handled
MERGED event; merged=IntegrationTestBigLinkedList,\xB2\xFE\x03s,1397672795119.80159738a0167e20a2e29fb2c46702f2.,
	
> 	region_a=IntegrationTestBigLinkedList,\xB2\xFE\x03s,1397670978959.0ac116e4d7da87922d3a8f218ca21079.,

> 	region_b=IntegrationTestBigLinkedList,\xB8\x03\x94\x15,1397672587456.1265d06478082ced65dd9a0c1c2b63c2.,
on hor13n03.gq1.ygridcore.net,60020,1397672668647
> {code}
> The region server shows the merge is complete: 
> {code}
> 2014-04-16 18:26:37,352 INFO  [regionserver60020-merges-1397672794972] regionserver.RegionMergeRequest:
Regions merged, hbase:meta updated, and report to master. region_a=IntegrationTestBigLinkedList,\xB2\xFE\x03s,1397670978959.0ac116e4d7da87922d3a8f218ca21079.,
region_b=IntegrationTestBigLinkedList,\xB8\x03\x94\x15,1397672587456.1265d06478082ced65dd9a0c1c2b63c2..
Region merge took 2sec
> {code}
> The new region was online on the region server for some time: 
> {code}
> 2014-04-16 18:31:22,858 DEBUG [RS_OPEN_REGION-hor13n03:60020-1] handler.OpenRegionHandler:
Opened IntegrationTestBigLinkedList,\xB2\xFE\x03s,1397672795119.80159738a0167e20a2e29fb2c46702f2.
on hor13n03.gq1.ygridcore.net,60020,1397672668647
> {code}
> Then the region server was killed around 2014-04-16 18:31:26,254. 
> The master started log splitting etc for the dead RS:
> {code}
> 2014-04-16 18:31:28,942 INFO  [MASTER_SERVER_OPERATIONS-hor13n02:60000-3] handler.ServerShutdownHandler:
Splitting logs for hor13n03.gq1.ygridcore.net,60020,1397672668647 before assignment.
> ..
> 2014-04-16 18:31:40,887 INFO  [MASTER_SERVER_OPERATIONS-hor13n02:60000-3] master.RegionStates:
Transitioned {80159738a0167e20a2e29fb2c46702f2 state=OPEN, ts=1397673082874, server=hor13n03.gq1.ygridcore.net,60020,1397672668647}
to {80159738a0167e20a2e29fb
> 2014-04-16 18:31:40,887 INFO  [MASTER_SERVER_OPERATIONS-hor13n02:60000-3] master.RegionStates:
Offlined 80159738a0167e20a2e29fb2c46702f2 from hor13n03.gq1.ygridcore.net,60020,1397672668647
> {code}
> But this region was not assigned again at all. Instead, the master restarted shortly
after, and reassigned the regions that were merged already: 
> {code}
> 2014-04-16 18:34:02,569 INFO  [master:hor13n02:60000] master.ActiveMasterManager: Registered
Active Master=hor13n02.gq1.ygridcore.net,60000,1397673241215
> ...
> 2014-04-16 18:34:10,412 INFO  [master:hor13n02:60000] master.AssignmentManager: Found
regions out on cluster or in RIT; presuming failover
> 2014-04-16 18:34:10,412 WARN  [master:hor13n02:60000] master.ServerManager: Expiration
of hor13n03.gq1.ygridcore.net,60020,1397671753021 but server not online
> ..
> 2014-04-16 18:34:10,880 INFO  [MASTER_SERVER_OPERATIONS-hor13n02:60000-3] master.AssignmentManager:
Bulk assigning 28 region(s) across 4 server(s), round-robin=true
> ..
> 2014-04-16 18:34:10,882 DEBUG [hor13n02.gq1.ygridcore.net,60000,1397673241215-GeneralBulkAssigner-1]
zookeeper.ZKAssign: master:60000-0x2456a4863640255, quorum=hor13n04.gq1.ygridcore.net:2181,hor13n03.gq1.ygridcore.net:2181,hor13n20.gq1.ygridcore.net:2181,
baseZNode=/hbase Async create of unassigned node 0ac116e4d7da8792..
> ..
> 2014-04-16 18:34:13,077 INFO  [AM.ZK.Worker-pool2-t7] master.RegionStates: Onlined 0ac116e4d7da87922d3a8f218ca21079
on hor13n04.gq1.ygridcore.net,60020,1397672685370
> {code}
> Before the master went down, there were other logs that indicate something funky: 
> {code}
> 2014-04-16 18:32:42,113 DEBUG [hor13n02.gq1.ygridcore.net,60000,1397672191204-org.apache.hadoop.hbase.master.BulkReOpen-0]
master.AssignmentManager: Starting unassign of IntegrationTestBigLinkedList,\xB2\xFE\x03s,1397670978959.0ac116e4d7da87922d3a8f218ca
> 2014-04-16 18:32:42,113 INFO  [hor13n02.gq1.ygridcore.net,60000,1397672191204-org.apache.hadoop.hbase.master.BulkReOpen-0]
master.AssignmentManager: Attempting to unassign {0ac116e4d7da87922d3a8f218ca21079 state=MERGED,
ts=1397672797223, server=hor13n03.gq1.ygridcore.net,60020,1397672668647}, ignored
> {code}
> This above log is due to bulk reopening because of online schema change: 
> {code}
> 2014-04-16 18:32:32,081 DEBUG [FifoRpcScheduler.handler1-thread-27] handler.TableEventHandler:
Ignoring table not disabled exception for supporting online schema changes.
> 2014-04-16 18:32:32,082 INFO  [FifoRpcScheduler.handler1-thread-27] handler.TableEventHandler:
Handling table operation C_M_MODIFY_TABLE on table IntegrationTestBigLinkedList
> 2014-04-16 18:32:32,150 DEBUG [FifoRpcScheduler.handler1-thread-27] util.FSTableDescriptors:
Wrote descriptor into: hdfs://hor13n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/.tabledesc/.tableinfo.0000000051
> 2014-04-16 18:32:32,159 DEBUG [FifoRpcScheduler.handler1-thread-27] util.FSTableDescriptors:
Deleted table descriptor at hdfs://hor13n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/.tabledesc/.tableinfo.0000000050
> 2014-04-16 18:32:32,159 INFO  [FifoRpcScheduler.handler1-thread-27] util.FSTableDescriptors:
Updated tableinfo=hdfs://hor13n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/.tabledesc/.tableinfo.0000000051
> 2014-04-16 18:32:32,160 INFO  [FifoRpcScheduler.handler1-thread-27] handler.TableEventHandler:
Bucketing regions by region server...
> 2014-04-16 18:32:32,191 INFO  [FifoRpcScheduler.handler1-thread-27] handler.TableEventHandler:
Skip {ENCODED => e81668b7b64a9c92dd3689db96a3adb0, NAME => 'IntegrationTestBigLinkedList,\xC0\x03O,1397669732121.e81668b7b64a9c92dd3689db96a3adb0.',
STARTKEY =
> 2014-04-16 18:32:32,191 INFO  [FifoRpcScheduler.handler1-thread-27] handler.TableEventHandler:
Skip {ENCODED => a51376af5b2aaef5bde980d48b05d75d, NAME => 'IntegrationTestBigLinkedList,\xE8\x01\x99,,1397671676090.a51376af5b2aaef5bde980d48b05d75d.',
STARTK
> 2014-04-16 18:32:32,191 INFO  [FifoRpcScheduler.handler1-thread-27] handler.TableEventHandler:
Reopening 51 regions on 5 region servers.
> {code}
> My theory is that this may have left the regions in RIT state, and when the master was
restarted it saw those in RIT and assigned. 
> Also notice that this is not an issue about a concurrent merge and online schema change
that can be prevented with table locks. The merge happened minutes ago, however the online
schema change still wanted to reassign the merged regions. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message