hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-11036) Online schema change with region merge may cause data loss
Date Fri, 18 Apr 2014 23:22:15 GMT
Enis Soztutar created HBASE-11036:
-------------------------------------

             Summary: Online schema change with region merge may cause data loss 
                 Key: HBASE-11036
                 URL: https://issues.apache.org/jira/browse/HBASE-11036
             Project: HBase
          Issue Type: Bug
            Reporter: Enis Soztutar
             Fix For: 0.99.0, 0.98.2


We have found out that online schema change and region merges may still cause issues about
merged regions coming back online, and thus causing data loss. 

Recently ITBLL failed reporting 800K missing rows out of 720M. We've been running this test
for some extended period of time, and this is the first we are seeing it, meaning that it
is more rare. But it is still concerning. 

>From master's log:
The merge has happened:
{code}
2014-04-16 18:26:37,247 INFO  [AM.ZK.Worker-pool2-t73] master.AssignmentManager: Handled MERGED
event; merged=IntegrationTestBigLinkedList,\xB2\xFE\x03s,1397672795119.80159738a0167e20a2e29fb2c46702f2.,
	
	region_a=IntegrationTestBigLinkedList,\xB2\xFE\x03s,1397670978959.0ac116e4d7da87922d3a8f218ca21079.,

	region_b=IntegrationTestBigLinkedList,\xB8\x03\x94\x15,1397672587456.1265d06478082ced65dd9a0c1c2b63c2.,
on hor13n03.gq1.ygridcore.net,60020,1397672668647
{code}

The region server shows the merge is complete: 
{code}
2014-04-16 18:26:37,352 INFO  [regionserver60020-merges-1397672794972] regionserver.RegionMergeRequest:
Regions merged, hbase:meta updated, and report to master. region_a=IntegrationTestBigLinkedList,\xB2\xFE\x03s,1397670978959.0ac116e4d7da87922d3a8f218ca21079.,
region_b=IntegrationTestBigLinkedList,\xB8\x03\x94\x15,1397672587456.1265d06478082ced65dd9a0c1c2b63c2..
Region merge took 2sec
{code}

The new region was online on the region server for some time: 
{code}
2014-04-16 18:31:22,858 DEBUG [RS_OPEN_REGION-hor13n03:60020-1] handler.OpenRegionHandler:
Opened IntegrationTestBigLinkedList,\xB2\xFE\x03s,1397672795119.80159738a0167e20a2e29fb2c46702f2.
on hor13n03.gq1.ygridcore.net,60020,1397672668647
{code}
Then the region server was killed around 2014-04-16 18:31:26,254. 

The master started log splitting etc for the dead RS:
{code}
2014-04-16 18:31:28,942 INFO  [MASTER_SERVER_OPERATIONS-hor13n02:60000-3] handler.ServerShutdownHandler:
Splitting logs for hor13n03.gq1.ygridcore.net,60020,1397672668647 before assignment.
..
2014-04-16 18:31:40,887 INFO  [MASTER_SERVER_OPERATIONS-hor13n02:60000-3] master.RegionStates:
Transitioned {80159738a0167e20a2e29fb2c46702f2 state=OPEN, ts=1397673082874, server=hor13n03.gq1.ygridcore.net,60020,1397672668647}
to {80159738a0167e20a2e29fb
2014-04-16 18:31:40,887 INFO  [MASTER_SERVER_OPERATIONS-hor13n02:60000-3] master.RegionStates:
Offlined 80159738a0167e20a2e29fb2c46702f2 from hor13n03.gq1.ygridcore.net,60020,1397672668647
{code}

But this region was not assigned again at all. Instead, the master restarted shortly after,
and reassigned the regions that were merged already: 
{code}
2014-04-16 18:34:02,569 INFO  [master:hor13n02:60000] master.ActiveMasterManager: Registered
Active Master=hor13n02.gq1.ygridcore.net,60000,1397673241215
...
2014-04-16 18:34:10,412 INFO  [master:hor13n02:60000] master.AssignmentManager: Found regions
out on cluster or in RIT; presuming failover
2014-04-16 18:34:10,412 WARN  [master:hor13n02:60000] master.ServerManager: Expiration of
hor13n03.gq1.ygridcore.net,60020,1397671753021 but server not online
..
2014-04-16 18:34:10,880 INFO  [MASTER_SERVER_OPERATIONS-hor13n02:60000-3] master.AssignmentManager:
Bulk assigning 28 region(s) across 4 server(s), round-robin=true
..
2014-04-16 18:34:10,882 DEBUG [hor13n02.gq1.ygridcore.net,60000,1397673241215-GeneralBulkAssigner-1]
zookeeper.ZKAssign: master:60000-0x2456a4863640255, quorum=hor13n04.gq1.ygridcore.net:2181,hor13n03.gq1.ygridcore.net:2181,hor13n20.gq1.ygridcore.net:2181,
baseZNode=/hbase Async create of unassigned node 0ac116e4d7da8792..
..
2014-04-16 18:34:13,077 INFO  [AM.ZK.Worker-pool2-t7] master.RegionStates: Onlined 0ac116e4d7da87922d3a8f218ca21079
on hor13n04.gq1.ygridcore.net,60020,1397672685370
{code}


Before the master went down, there were other logs that indicate something funky: 
{code}
2014-04-16 18:32:42,113 DEBUG [hor13n02.gq1.ygridcore.net,60000,1397672191204-org.apache.hadoop.hbase.master.BulkReOpen-0]
master.AssignmentManager: Starting unassign of IntegrationTestBigLinkedList,\xB2\xFE\x03s,1397670978959.0ac116e4d7da87922d3a8f218ca
2014-04-16 18:32:42,113 INFO  [hor13n02.gq1.ygridcore.net,60000,1397672191204-org.apache.hadoop.hbase.master.BulkReOpen-0]
master.AssignmentManager: Attempting to unassign {0ac116e4d7da87922d3a8f218ca21079 state=MERGED,
ts=1397672797223, server=hor13n03.gq1.ygridcore.net,60020,1397672668647}, ignored
{code}

This above log is due to bulk reopening because of online schema change: 
{code}
2014-04-16 18:32:32,081 DEBUG [FifoRpcScheduler.handler1-thread-27] handler.TableEventHandler:
Ignoring table not disabled exception for supporting online schema changes.
2014-04-16 18:32:32,082 INFO  [FifoRpcScheduler.handler1-thread-27] handler.TableEventHandler:
Handling table operation C_M_MODIFY_TABLE on table IntegrationTestBigLinkedList
2014-04-16 18:32:32,150 DEBUG [FifoRpcScheduler.handler1-thread-27] util.FSTableDescriptors:
Wrote descriptor into: hdfs://hor13n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/.tabledesc/.tableinfo.0000000051
2014-04-16 18:32:32,159 DEBUG [FifoRpcScheduler.handler1-thread-27] util.FSTableDescriptors:
Deleted table descriptor at hdfs://hor13n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/.tabledesc/.tableinfo.0000000050
2014-04-16 18:32:32,159 INFO  [FifoRpcScheduler.handler1-thread-27] util.FSTableDescriptors:
Updated tableinfo=hdfs://hor13n01.gq1.ygridcore.net:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/.tabledesc/.tableinfo.0000000051
2014-04-16 18:32:32,160 INFO  [FifoRpcScheduler.handler1-thread-27] handler.TableEventHandler:
Bucketing regions by region server...
2014-04-16 18:32:32,191 INFO  [FifoRpcScheduler.handler1-thread-27] handler.TableEventHandler:
Skip {ENCODED => e81668b7b64a9c92dd3689db96a3adb0, NAME => 'IntegrationTestBigLinkedList,\xC0\x03O,1397669732121.e81668b7b64a9c92dd3689db96a3adb0.',
STARTKEY =
2014-04-16 18:32:32,191 INFO  [FifoRpcScheduler.handler1-thread-27] handler.TableEventHandler:
Skip {ENCODED => a51376af5b2aaef5bde980d48b05d75d, NAME => 'IntegrationTestBigLinkedList,\xE8\x01\x99,,1397671676090.a51376af5b2aaef5bde980d48b05d75d.',
STARTK
2014-04-16 18:32:32,191 INFO  [FifoRpcScheduler.handler1-thread-27] handler.TableEventHandler:
Reopening 51 regions on 5 region servers.
{code}

My theory is that this may have left the regions in RIT state, and when the master was restarted
it saw those in RIT and assigned. 

Also notice that this is not an issue about a concurrent merge and online schema change that
can be prevented with table locks. The merge happened minutes ago, however the online schema
change still wanted to reassign the merged regions. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message