hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vladimir Rodionov (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-12190) Split thread pool larger than one results in fatal failure during region splits
Date Tue, 07 Oct 2014 18:14:34 GMT
Vladimir Rodionov created HBASE-12190:
-----------------------------------------

             Summary: Split thread pool larger than one results in fatal failure during region
splits
                 Key: HBASE-12190
                 URL: https://issues.apache.org/jira/browse/HBASE-12190
             Project: HBase
          Issue Type: Bug
          Components: regionserver
    Affects Versions: 2.0.0
            Reporter: Vladimir Rodionov


I have been playing with different sizes for long/short compaction and split thread poll sizes.
It is definitely clear, that any size of split thread pool large than 1 (one) results in a
fatal errors during region splits.

Here is the log file snippet of a local test case :
{code}
2014-10-07 11:04:48,441 [INFO|org.apache.hadoop.hbase.regionserver.HRegionFileSystem|HRegionFileSystem]
The hdfs://localhost:50215/user/vrodionov/hbase/data/default/TABLE_A/0ace9f564c05043a45cb44b2867432f8/.splits
directory exists.  Hence deleting it to recreate it
put: 230000
2014-10-07 11:04:48,697 [INFO|BlockStateChange|BlockManager] BLOCK* addStoredBlock: blockMap
updated: 127.0.0.1:50216 is added to blk_1073741862_1038{blockUCState=COMMITTED, primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[[DISK]DS-0b95bff8-1354-42cc-bd16-a5eafcb5bb27:NORMAL|RBW]]}
size 22106983
2014-10-07 11:04:49,100 [INFO|org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher|DefaultStoreFlusher]
Flushed, sequenceid=495, memsize=63.2 M, hasBloomFilter=true, into tmp file hdfs://localhost:50215/user/vrodionov/hbase/data/default/TABLE_A/0ace9f564c05043a45cb44b2867432f8/.tmp/8d746dd585df4669a505fe2ba0ddafe1
2014-10-07 11:04:49,110 [INFO|org.apache.hadoop.hbase.regionserver.HStore|HStore] Added hdfs://localhost:50215/user/vrodionov/hbase/data/default/TABLE_A/0ace9f564c05043a45cb44b2867432f8/fam_a/8d746dd585df4669a505fe2ba0ddafe1,
entries=360000, sequenceid=495, filesize=21.1 M
2014-10-07 11:04:49,111 [INFO|org.apache.hadoop.hbase.regionserver.HRegion|HRegion] Finished
memstore flush of ~63.17 MB/66240000, currentsize=42.11 MB/44160000 for region TABLE_A,row000000037607,1412705075738.0ace9f564c05043a45cb44b2867432f8.
in 674ms, sequenceid=495, compaction requested=true
2014-10-07 11:04:49,111 [INFO|org.apache.hadoop.hbase.regionserver.HRegion|HRegion] Started
memstore flush for TABLE_A,row000000037607,1412705075738.0ace9f564c05043a45cb44b2867432f8.,
current region memstore size 42.11 MB
2014-10-07 11:04:49,267 [INFO|BlockStateChange|BlockManager] BLOCK* addStoredBlock: blockMap
updated: 127.0.0.1:50216 is added to blk_1073741863_1039{blockUCState=COMMITTED, primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[[DISK]DS-414e0e79-1c18-4469-9458-6842a0b96131:NORMAL|RBW]]}
size 14745108
2014-10-07 11:04:49,669 [INFO|org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher|DefaultStoreFlusher]
Flushed, sequenceid=514, memsize=42.1 M, hasBloomFilter=true, into tmp file hdfs://localhost:50215/user/vrodionov/hbase/data/default/TABLE_A/0ace9f564c05043a45cb44b2867432f8/.tmp/785a7048b1b849509e3e622cdaceb033
2014-10-07 11:04:49,682 [INFO|org.apache.hadoop.hbase.regionserver.HStore|HStore] Added hdfs://localhost:50215/user/vrodionov/hbase/data/default/TABLE_A/0ace9f564c05043a45cb44b2867432f8/fam_a/785a7048b1b849509e3e622cdaceb033,
entries=240000, sequenceid=514, filesize=14.1 M
2014-10-07 11:04:49,683 [INFO|org.apache.hadoop.hbase.regionserver.HRegion|HRegion] Finished
memstore flush of ~42.11 MB/44160000, currentsize=0 B/0 for region TABLE_A,row000000037607,1412705075738.0ace9f564c05043a45cb44b2867432f8.
in 572ms, sequenceid=514, compaction requested=true
2014-10-07 11:04:49,686 [INFO|org.apache.hadoop.hbase.regionserver.HStore|HStore] Closed fam_a
2014-10-07 11:04:49,687 [INFO|org.apache.hadoop.hbase.regionserver.HRegion|HRegion] Closed
TABLE_A,row000000037607,1412705075738.0ace9f564c05043a45cb44b2867432f8.
2014-10-07 11:04:49,687 [WARN|org.apache.hadoop.hbase.regionserver.HRegion|HRegion] Region
TABLE_A,row000000037607,1412705075738.0ace9f564c05043a45cb44b2867432f8. already closed
2014-10-07 11:04:49,688 [INFO|org.apache.hadoop.hbase.regionserver.SplitRequest|SplitRequest]
Running rollback/cleanup of failed split of TABLE_A,row000000037607,1412705075738.0ace9f564c05043a45cb44b2867432f8.;
Failed to close region: already closed by another thread
java.io.IOException: Failed to close region: already closed by another thread
	at org.apache.hadoop.hbase.regionserver.SplitTransaction.<clinit>(SplitTransaction.java:190)
	at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:65)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
2014-10-07 11:04:49,690 [ERROR|org.apache.hadoop.hbase.master.AssignmentManager|AssignmentManager]
Failed to transition region from {0ace9f564c05043a45cb44b2867432f8 state=SPLITTING, ts=1412705087544,
server=10.1.2.105,50226,1412705045018} on SPLIT_REVERTED by 10.1.2.105,50226,1412705045018:
Some daughter is not known to be splitting on 10.1.2.105,50226,1412705045018, a=null, b=null
2014-10-07 11:04:49,690 [INFO|org.apache.hadoop.hbase.regionserver.HRegionServer|HRegionServer]
Failed to transition {ENCODED => 0ace9f564c05043a45cb44b2867432f8, NAME => 'TABLE_A,row000000037607,1412705075738.0ace9f564c05043a45cb44b2867432f8.',
STARTKEY => 'row000000037607', ENDKEY => ''} to SPLIT_REVERTED: Some daughter is not
known to be splitting on 10.1.2.105,50226,1412705045018, a=null, b=null
2014-10-07 11:04:49,692 [FATAL|org.apache.hadoop.hbase.regionserver.HRegionServer|HRegionServer]
ABORTING region server 10.1.2.105,50226,1412705045018: Abort; we got an error after point-of-no-return
2014-10-07 11:04:49,692 [FATAL|org.apache.hadoop.hbase.regionserver.HRegionServer|HRegionServer]
RegionServer abort: loaded coprocessors are: [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]
2014-10-07 11:04:49,699 [INFO|BlockStateChange|BlockManager] BLOCK* addStoredBlock: blockMap
updated: 127.0.0.1:50216 is added to blk_1073741864_1040{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-0b95bff8-1354-42cc-bd16-a5eafcb5bb27:NORMAL|RBW]]}
size 0
2014-10-07 11:04:49,700 [ERROR|org.apache.hadoop.hbase.master.MasterRpcServices|MasterRpcServices]
Region server 10.1.2.105,50226,1412705045018 reported a fatal error:
ABORTING region server 10.1.2.105,50226,1412705045018: Abort; we got an error after point-of-no-return
2014-10-07 11:04:49,702 [INFO|org.apache.hadoop.hbase.regionserver.HRegionServer|HRegionServer]
STOPPED: Abort; we got an error after point-of-no-return
2014-10-07 11:04:49,702 [INFO|org.apache.hadoop.hbase.regionserver.SplitLogWorker|SplitLogWorker]
Sending interrupt to stop the worker thread
2014-10-07 11:04:49,702 [INFO|org.apache.hadoop.hbase.regionserver.HRegionServer|HRegionServer]
Stopping infoServer
2014-10-07 11:04:49,702 [INFO|org.apache.hadoop.hbase.regionserver.SplitLogWorker|SplitLogWorker]
SplitLogWorker interrupted. Exiting. 
2014-10-07 11:04:49,702 [INFO|org.apache.hadoop.hbase.regionserver.SplitLogWorker|SplitLogWorker]
SplitLogWorker 10.1.2.105,50226,1412705045018 exiting
2014-10-07 11:04:49,704 [INFO|org.mortbay.log|Slf4jLog] Stopped SelectChannelConnector@0.0.0.0:0
2014-10-07 11:04:49,706 [INFO|BlockStateChange|BlockManager] BLOCK* addStoredBlock: blockMap
updated: 127.0.0.1:50216 is added to blk_1073741865_1041{blockUCState=COMMITTED, primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[[DISK]DS-414e0e79-1c18-4469-9458-6842a0b96131:NORMAL|RBW]]}
size 35
2014-10-07 11:04:49,807 [INFO|org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager|RegionServerSnapshotManager]
Stopping RegionServerSnapshotManager abruptly.
2014-10-07 11:04:49,808 [INFO|org.apache.hadoop.hbase.regionserver.ServerNonceManager$1|Chore]
regionserver//10.1.2.105:0.nonceCleaner exiting
2014-10-07 11:04:49,808 [INFO|org.apache.hadoop.hbase.procedure.flush.RegionServerFlushTableProcedureManager|RegionServerFlushTableProcedureManager]
Stopping region server flush procedure manager abruptly.
2014-10-07 11:04:49,808 [INFO|org.apache.hadoop.hbase.regionserver.MemStoreFlusher|MemStoreFlusher$FlushHandler]
MemStoreFlusher.1 exiting
2014-10-07 11:04:49,808 [INFO|org.apache.hadoop.hbase.regionserver.MemStoreFlusher|MemStoreFlusher$FlushHandler]
MemStoreFlusher.0 exiting
2014-10-07 11:04:49,807 [INFO|org.apache.hadoop.hbase.regionserver.HRegionServer$CompactionChecker|Chore]
regionserver//10.1.2.105:0.compactionChecker exiting
2014-10-07 11:04:49,809 [INFO|org.apache.hadoop.hbase.regionserver.HRegionServer|HRegionServer]
aborting server 10.1.2.105,50226,1412705045018
2014-10-07 11:04:49,809 [INFO|org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation|ConnectionManager$HConnectionImplementation]
Closing zookeeper sessionid=0x148ebc846910005
2014-10-07 11:04:49,812 [INFO|org.apache.hadoop.hbase.regionserver.HStore|HStore] Closed fam_a
2014-10-07 11:04:49,813 [INFO|org.apache.hadoop.hbase.regionserver.HRegionServer|HRegionServer]
Waiting on 1 regions to close
2014-10-07 11:04:49,813 [INFO|org.apache.hadoop.hbase.regionserver.HRegion|HRegion] Closed
TABLE_A,,1412705075738.3b2edac70735ec721cd1ccba05211e33.
2014-10-07 11:04:50,014 [INFO|org.apache.hadoop.hbase.regionserver.HRegionServer|HRegionServer]
stopping server 10.1.2.105,50226,1412705045018; all regions closed.
2014-10-07 11:04:50,019 [INFO|BlockStateChange|BlockManager] BLOCK* addStoredBlock: blockMap
updated: 127.0.0.1:50216 is added to blk_1073741860_1036{blockUCState=COMMITTED, primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[[DISK]DS-0b95bff8-1354-42cc-bd16-a5eafcb5bb27:NORMAL|RBW]]}
size 37803365
2014-10-07 11:04:50,118 [INFO|BlockStateChange|BlockManager] BLOCK* addStoredBlock: blockMap
updated: 127.0.0.1:50216 is added to blk_1073741866_1042{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-0b95bff8-1354-42cc-bd16-a5eafcb5bb27:NORMAL|RBW]]}
size 0
2014-10-07 11:04:50,130 [INFO|BlockStateChange|BlockManager] BLOCK* addStoredBlock: blockMap
updated: 127.0.0.1:50216 is added to blk_1073741867_1043{blockUCState=COMMITTED, primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[[DISK]DS-414e0e79-1c18-4469-9458-6842a0b96131:NORMAL|RBW]]}
size 35
2014-10-07 11:04:50,522 [INFO|org.apache.hadoop.hbase.regionserver.Leases|Leases] RS:0;10.1.2.105:50226
closing leases
2014-10-07 11:04:50,522 [INFO|org.apache.hadoop.hbase.regionserver.Leases|Leases] RS:0;10.1.2.105:50226
closed leases
2014-10-07 11:04:50,546 [INFO|BlockStateChange|BlockManager] BLOCK* addStoredBlock: blockMap
updated: 127.0.0.1:50216 is added to blk_1073741868_1044{blockUCState=COMMITTED, primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[[DISK]DS-0b95bff8-1354-42cc-bd16-a5eafcb5bb27:NORMAL|RBW]]}
size 35
2014-10-07 11:04:50,960 [INFO|BlockStateChange|BlockManager] BLOCK* addStoredBlock: blockMap
updated: 127.0.0.1:50216 is added to blk_1073741869_1045{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-414e0e79-1c18-4469-9458-6842a0b96131:NORMAL|RBW]]}
size 0
2014-10-07 11:04:50,966 [INFO|BlockStateChange|BlockManager] BLOCK* addStoredBlock: blockMap
updated: 127.0.0.1:50216 is added to blk_1073741870_1046{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-0b95bff8-1354-42cc-bd16-a5eafcb5bb27:NORMAL|RBW]]}
size 0
2014-10-07 11:04:50,974 [INFO|BlockStateChange|BlockManager] BLOCK* addStoredBlock: blockMap
updated: 127.0.0.1:50216 is added to blk_1073741871_1047{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-414e0e79-1c18-4469-9458-6842a0b96131:NORMAL|RBW]]}
size 0
2014-10-07 11:04:50,975 [INFO|org.apache.hadoop.hbase.regionserver.SplitRequest|SplitRequest]
Skip rollback/cleanup of failed split of TABLE_A,row000000037607,1412705075738.0ace9f564c05043a45cb44b2867432f8.
because server is stopped
java.io.IOException: Failed to notify master that split passed PONR: TABLE_A,row000000037607,1412705075738.0ace9f564c05043a45cb44b2867432f8.
	at org.apache.hadoop.hbase.regionserver.SplitTransaction.createDaughters(SplitTransaction.java:274)
	at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:407)
	at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:82)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
2014-10-07 11:04:55,340 [INFO|org.apache.hadoop.hbase.regionserver.HRegionServer$PeriodicMemstoreFlusher|Chore]
regionserver//10.1.2.105:0.periodicFlusher exiting
2014-10-07 11:04:55,340 [INFO|org.apache.hadoop.hbase.regionserver.Leases|Leases] regionserver//10.1.2.105:0.leaseChecker
closing leases
2014-10-07 11:04:55,341 [INFO|org.apache.hadoop.hbase.regionserver.Leases|Leases] regionserver//10.1.2.105:0.leaseChecker
closed leases
2014-10-07 11:04:57,509 [INFO|org.apache.hadoop.hbase.regionserver.LogRoller|LogRoller] LogRoller
exiting.
2014-10-07 11:04:57,509 [INFO|org.apache.hadoop.hbase.regionserver.CompactSplitThread|CompactSplitThread]
Waiting for Split Thread to finish...
2014-10-07 11:04:57,509 [INFO|org.apache.hadoop.hbase.regionserver.CompactSplitThread|CompactSplitThread]
Waiting for Merge Thread to finish...
2014-10-07 11:04:57,509 [INFO|org.apache.hadoop.hbase.regionserver.CompactSplitThread|CompactSplitThread]
Waiting for Large Compaction Thread to finish...
2014-10-07 11:04:57,509 [INFO|org.apache.hadoop.hbase.regionserver.CompactSplitThread|CompactSplitThread]
Waiting for Small Compaction Thread to finish...
2014-10-07 11:04:57,512 [INFO|org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation|ConnectionManager$HConnectionImplementation]
Closing zookeeper sessionid=0x148ebc846910006
2014-10-07 11:04:57,512 [INFO|org.apache.hadoop.hbase.ipc.RpcServer|RpcServer] Stopping server
on 50226
2014-10-07 11:04:57,513 [INFO|org.apache.hadoop.hbase.ipc.RpcServer|RpcServer$Listener] RpcServer.listener,port=50226:
stopping
2014-10-07 11:04:57,513 [INFO|org.apache.hadoop.hbase.ipc.RpcServer|RpcServer$Responder] RpcServer.responder:
stopped
2014-10-07 11:04:57,513 [INFO|org.apache.hadoop.hbase.ipc.RpcServer|RpcServer$Responder] RpcServer.responder:
stopping
{code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message