hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matteo Bertozzi (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-10136) the table-lock of TableEventHandler is released too early because reOpenAllRegions() is asynchronous
Date Tue, 17 Dec 2013 10:06:08 GMT

     [ https://issues.apache.org/jira/browse/HBASE-10136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Matteo Bertozzi updated HBASE-10136:
------------------------------------

    Description: 
Expected behavior:
With the introduction of the table-lock, a user can issue a request for a snapshot of a table
while that table is undergoing an online schema change and expect that snapshot request to
complete correctly. Also, the same is true if a user issues a online schema change request
while a snapshot attempt is ongoing.

Observed behavior:
Snapshot attempts time out when there is an ongoing online schema change because the table
lock is not acquired by anyone else and the regions are closed and opened during the snapshot.


TableEventHandler trace
{code}
// 1. client.addColumn() call from client...

// 2. The operation is now on the master
2013-12-12 12:09:57,613 DEBUG [MASTER] lock.ZKInterProcessLockBase: Acquired a lock for /hbase/table-lock/TestTable/write-master:452010000000001
2013-12-12 12:09:57,640 INFO  [MASTER] handler.TableEventHandler: Handling table operation
C_M_ADD_FAMILY on table TestTable
2013-12-12 12:09:57,685 INFO  [MASTER] master.MasterFileSystem: AddColumn. Table = TestTable
HCD = {NAME => 'x-1386850197327', DATA_BLOCK_ENCODING => 'NONE',$
2013-12-12 12:09:57,693 INFO  [MASTER] handler.TableEventHandler: Bucketing regions by region
server...
...
2013-12-12 12:09:57,771 INFO  [MASTER] handler.TableEventHandler: Completed table operation
C_M_ADD_FAMILY on table TestTable
2013-12-12 12:09:57,771 DEBUG [MASTER] master.AssignmentManager: Starting unassign of TestTable,,1386849056038.854b280$
2013-12-12 12:09:57,772 DEBUG [MASTER] lock.ZKInterProcessLockBase: Released /hbase/table-lock/TestTable/write-master:452010000000001

// 3. The Table*Handler operation is now completed, and the client notified with "I'm done!"

// 4. Now the BulkReopen is starting doing the reopen
2013-12-12 12:09:57,772 INFO  [MASTER] master.RegionStates: Transitioned {854b280006aec464083778a5cb5f5456
state=OPEN,$
{code}

  was:
Expected behavior:
A user can issue a request for a snapshot of a table while that table is undergoing an online
schema change and expect that snapshot request to complete correctly. Also, the same is true
if a user issues a online schema change request while a snapshot attempt is ongoing.

Observed behavior:
Snapshot attempts time out when there is an ongoing online schema change because the region
is closed and opened during the snapshot. 

As a side-note, I would expect that the attempt should fail quickly as opposed to timing out.


Further, what I have seen is that subsequent attempts to snapshot the table fail because of
some state/cleanup issues. This is also concerning.

Immediate error:
{code}type=FLUSH }' is still in progress!
2013-12-11 15:58:32,883 DEBUG [Thread-385] client.HBaseAdmin(2696): (#11) Sleeping: 10000ms
while waiting for snapshot completion.
2013-12-11 15:58:42,884 DEBUG [Thread-385] client.HBaseAdmin(2704): Getting current status
of snapshot from master...
2013-12-11 15:58:42,887 DEBUG [FifoRpcScheduler.handler1-thread-3] master.HMaster(2891): Checking
to see if snapshot from request:{ ss=snapshot0 table=changeSchemaDuringSnapshot1386806258640
type=FLUSH } is done
2013-12-11 15:58:42,887 DEBUG [FifoRpcScheduler.handler1-thread-3] snapshot.SnapshotManager(374):
Snapshoting '{ ss=snapshot0 table=changeSchemaDuringSnapshot1386806258640 type=FLUSH }' is
still in progress!
Snapshot failure occurred
org.apache.hadoop.hbase.snapshot.SnapshotCreationException: Snapshot 'snapshot0' wasn't completed
in expectedTime:60000 ms
	at org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2713)
	at org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2638)
	at org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2602)
	at org.apache.hadoop.hbase.client.TestAdmin$BackgroundSnapshotThread.run(TestAdmin.java:1974){code}

Likely root cause of error:
{code}Exception in SnapshotSubprocedurePool
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.NotServingRegionException:
changeSchemaDuringSnapshot1386806258640,77777777,1386806258720.ea776db51749e39c956d771a7d17a0f3.
is closing
	at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
	at java.util.concurrent.FutureTask.get(FutureTask.java:83)
	at org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:314)
	at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.flushSnapshot(FlushSnapshotSubprocedure.java:118)
	at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.insideBarrier(FlushSnapshotSubprocedure.java:137)
	at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:181)
	at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:1)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.hbase.NotServingRegionException: changeSchemaDuringSnapshot1386806258640,77777777,1386806258720.ea776db51749e39c956d771a7d17a0f3.
is closing
	at org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:5327)
	at org.apache.hadoop.hbase.regionserver.HRegion.startRegionOperation(HRegion.java:5289)
	at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:79)
	at org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure$RegionSnapshotTask.call(FlushSnapshotSubprocedure.java:1)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
	... 5 more{code}


> the table-lock of TableEventHandler is released too early because reOpenAllRegions()
is asynchronous
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-10136
>                 URL: https://issues.apache.org/jira/browse/HBASE-10136
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.96.0, 0.98.1, 0.99.0
>            Reporter: Aleksandr Shulman
>            Assignee: Matteo Bertozzi
>              Labels: online_schema_change
>         Attachments: HBASE-10136-trunk.patch
>
>
> Expected behavior:
> With the introduction of the table-lock, a user can issue a request for a snapshot of
a table while that table is undergoing an online schema change and expect that snapshot request
to complete correctly. Also, the same is true if a user issues a online schema change request
while a snapshot attempt is ongoing.
> Observed behavior:
> Snapshot attempts time out when there is an ongoing online schema change because the
table lock is not acquired by anyone else and the regions are closed and opened during the
snapshot. 
> TableEventHandler trace
> {code}
> // 1. client.addColumn() call from client...
> // 2. The operation is now on the master
> 2013-12-12 12:09:57,613 DEBUG [MASTER] lock.ZKInterProcessLockBase: Acquired a lock for
/hbase/table-lock/TestTable/write-master:452010000000001
> 2013-12-12 12:09:57,640 INFO  [MASTER] handler.TableEventHandler: Handling table operation
C_M_ADD_FAMILY on table TestTable
> 2013-12-12 12:09:57,685 INFO  [MASTER] master.MasterFileSystem: AddColumn. Table = TestTable
HCD = {NAME => 'x-1386850197327', DATA_BLOCK_ENCODING => 'NONE',$
> 2013-12-12 12:09:57,693 INFO  [MASTER] handler.TableEventHandler: Bucketing regions by
region server...
> ...
> 2013-12-12 12:09:57,771 INFO  [MASTER] handler.TableEventHandler: Completed table operation
C_M_ADD_FAMILY on table TestTable
> 2013-12-12 12:09:57,771 DEBUG [MASTER] master.AssignmentManager: Starting unassign of
TestTable,,1386849056038.854b280$
> 2013-12-12 12:09:57,772 DEBUG [MASTER] lock.ZKInterProcessLockBase: Released /hbase/table-lock/TestTable/write-master:452010000000001
> // 3. The Table*Handler operation is now completed, and the client notified with "I'm
done!"
> // 4. Now the BulkReopen is starting doing the reopen
> 2013-12-12 12:09:57,772 INFO  [MASTER] master.RegionStates: Transitioned {854b280006aec464083778a5cb5f5456
state=OPEN,$
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Mime
View raw message