hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matteo Bertozzi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-10136) Alter table conflicts with concurrent snapshot attempt on that table
Date Thu, 12 Dec 2013 12:43:08 GMT

    [ https://issues.apache.org/jira/browse/HBASE-10136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13846277#comment-13846277
] 

Matteo Bertozzi commented on HBASE-10136:
-----------------------------------------

This is a more generic problem than alter table and snapshots.
both snapshots and alter table seems to acquire/release the table-lock properly, 
but the BulkReopen operation called by TableEventHandler.reOpenAllRegions() is asynchronous


{code}
[hbase/master/BulkReOpen.java]
/** Reopen the regions asynchronously, so always returns true immediately. */
protected boolean waitUntilDone(long timeout) {
  return true;
}
{code}

this means that each table operation that is using TableEventHandler.reOpenAllRegions(), in
this case the alter table operation, is releasing the lock before the operation is completed.

If you just try to call admin.addColumn() and take a look at the logs you can easily see that
the operation is done in an async way.
{code}
// 1. client.addColumn() call from client...

// 2. The operation is now on the master
2013-12-12 12:09:57,613 DEBUG [MASTER] lock.ZKInterProcessLockBase: Acquired a lock for /hbase/table-lock/TestTable/write-master:452010000000001
2013-12-12 12:09:57,640 INFO  [MASTER] handler.TableEventHandler: Handling table operation
C_M_ADD_FAMILY on table TestTable
2013-12-12 12:09:57,685 INFO  [MASTER] master.MasterFileSystem: AddColumn. Table = TestTable
HCD = {NAME => 'x-1386850197327', DATA_BLOCK_ENCODING => 'NONE',$
2013-12-12 12:09:57,693 INFO  [MASTER] handler.TableEventHandler: Bucketing regions by region
server...
...
2013-12-12 12:09:57,771 INFO  [MASTER] handler.TableEventHandler: Completed table operation
C_M_ADD_FAMILY on table TestTable
2013-12-12 12:09:57,771 DEBUG [MASTER] master.AssignmentManager: Starting unassign of TestTable,,1386849056038.854b280$
2013-12-12 12:09:57,772 DEBUG [MASTER] lock.ZKInterProcessLockBase: Released /hbase/table-lock/TestTable/write-master:452010000000001

// 3. The Table*Handler operation is now completed, and the client notified with "I'm done!"

// 4. Now the BulkReopen is starting doing the reopen
2013-12-12 12:09:57,772 INFO  [MASTER] master.RegionStates: Transitioned {854b280006aec464083778a5cb5f5456
state=OPEN,$
2013-12-12 12:09:57,772 INFO  [Priority.RpcServer.handler=5,port=39384] regionserver.HRegionServer:
Close 854b280006aec464083778a5cb5f5456, via zk=yes, znode version=0, on null
2013-12-12 12:09:57,773 DEBUG [RS_CLOSE_REGION] handler.CloseRegionHandler: Processing close
of TestTable,,1386849056038.854b280006aec464083778a5cb5f5456.
2013-12-12 12:09:57,773 DEBUG [MASTER] master.AssignmentManager: Sent CLOSE to localhost,39384,1386848453374
for regio$
2013-12-12 12:09:57,773 DEBUG [RS_CLOSE_REGION] regionserver.HRegion: Closing TestTable,,1386849056038.854b280006aec464083778a5cb5f5456.:
disabling compactions & flush$
2013-12-12 12:09:57,773 DEBUG [RS_CLOSE_REGION] regionserver.HRegion: Updates disabled for
region TestTable,,1386849056038.854b280006aec464083778a5cb5f5456.
2013-12-12 12:09:57,774 INFO  [StoreCloserThread-TestTable,,1386849056038.854b280006aec464083778a5cb5f5456.-1]
regionserver.HStore: Closed info
2013-12-12 12:09:57,774 INFO  [RS_CLOSE_REGION] regionserver.HRegion: Closed TestTable,,1386849056038.854b280006aec464083778a5cb5f5456.
2013-12-12 12:09:57,774 DEBUG [RS_CLOSE_REGION] zookeeper.ZKAssign: regionserver:39384-0x142e69bfea20001,
quorum=localhost:2181, baseZNode=/hbase Transitioning 854b280$
2013-12-12 12:09:57,775 DEBUG [RS_CLOSE_REGION] zookeeper.ZKAssign: regionserver:39384-0x142e69bfea20001,
quorum=localhost:2181, baseZNode=/hbase Transitioned node 854$
2013-12-12 12:09:57,775 DEBUG [RS_CLOSE_REGION] handler.CloseRegionHandler: Set closed state
in zk for TestTable,,1386849056038.854b280006aec464083778a5cb5f5456. on lo$
2013-12-12 12:09:57,775 DEBUG [RS_CLOSE_REGION] handler.CloseRegionHandler: Closed TestTable,,1386849056038.854b280006aec464083778a5cb5f5456.
2013-12-12 12:09:57,775 INFO  [ProcessThread(sid:0 cport:-1):] server.PrepRequestProcessor:
Processed session termination for sessionid: 0x142e69bfea20023
2013-12-12 12:09:57,775 DEBUG [AM.ZK.Worker-pool2-t70] master.AssignmentManager: Handling
RS_ZK_REGION_CLOSED, server=localhost,39384,1386848453374, region=854b280006aec464083778a5cb5f5$
{code}

so, this is another case where the Admin operation is not synchronous and also the table-lock
is not really able to do its job, since at the end of the handler.process() doesn't match
with "operation completed".

we can probably do a dirty fix for each case as they come up (see also HBASE-6992). 
but, This is a general problem with the current Master/Client Design. We don't really know
when an operation is completed. Looking at the new master designs (HBASE-5487) looks like
this stuff will be resolved, but the new master seems distant future...

> Alter table conflicts with concurrent snapshot attempt on that table
> --------------------------------------------------------------------
>
>                 Key: HBASE-10136
>                 URL: https://issues.apache.org/jira/browse/HBASE-10136
>             Project: HBase
>          Issue Type: Bug
>          Components: snapshots
>    Affects Versions: 0.96.0, 0.98.1, 0.99.0
>            Reporter: Aleksandr Shulman
>            Assignee: Matteo Bertozzi
>              Labels: online_schema_change
>
> Expected behavior:
> A user can issue a request for a snapshot of a table while that table is undergoing an
online schema change and expect that snapshot request to complete correctly. Also, the same
is true if a user issues a online schema change request while a snapshot attempt is ongoing.
> Observed behavior:
> Snapshot attempts time out when there is an ongoing online schema change because the
region is closed and opened during the snapshot. 
> As a side-note, I would expect that the attempt should fail quickly as opposed to timing
out. 
> Further, what I have seen is that subsequent attempts to snapshot the table fail because
of some state/cleanup issues. This is also concerning.
> Immediate error:
> {code}type=FLUSH }' is still in progress!
> 2013-12-11 15:58:32,883 DEBUG [Thread-385] client.HBaseAdmin(2696): (#11) Sleeping: 10000ms
while waiting for snapshot completion.
> 2013-12-11 15:58:42,884 DEBUG [Thread-385] client.HBaseAdmin(2704): Getting current status
of snapshot from master...
> 2013-12-11 15:58:42,887 DEBUG [FifoRpcScheduler.handler1-thread-3] master.HMaster(2891):
Checking to see if snapshot from request:{ ss=snapshot0 table=changeSchemaDuringSnapshot1386806258640
type=FLUSH } is done
> 2013-12-11 15:58:42,887 DEBUG [FifoRpcScheduler.handler1-thread-3] snapshot.SnapshotManager(374):
Snapshoting '{ ss=snapshot0 table=changeSchemaDuringSnapshot1386806258640 type=FLUSH }' is
still in progress!
> Snapshot failure occurred
> org.apache.hadoop.hbase.snapshot.SnapshotCreationException: Snapshot 'snapshot0' wasn't
completed in expectedTime:60000 ms
> 	at org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2713)
> 	at org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2638)
> 	at org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2602)
> 	at org.apache.hadoop.hbase.client.TestAdmin$BackgroundSnapshotThread.run(TestAdmin.java:1974){code}
> Likely root cause of error:
> {code}Exception in SnapshotSubprocedurePool
> java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.NotServingRegionException:
changeSchemaDuringSnapshot1386806258640,77777777,1386806258720.ea776db51749e39c956d771a7d17a0f3.
is closing
> 	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
> 	at java.util.concurrent.FutureTask.get(FutureTask.java:83)
> 	at org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:314)
> 	at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.flushSnapshot(FlushSnapshotSubprocedure.java:118)
> 	at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.insideBarrier(FlushSnapshotSubprocedure.java:137)
> 	at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:181)
> 	at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:1)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.hadoop.hbase.NotServingRegionException: changeSchemaDuringSnapshot1386806258640,77777777,1386806258720.ea776db51749e39c956d771a7d17a0f3.
is closing
> 	at org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:5327)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:5289)
> 	at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:79)
> 	at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:1)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
> 	... 5 more{code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Mime
View raw message