Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Wed, 16 Aug 2017 09:19:00 +0000 (UTC)
From: "ramkrishna.s.vasudevan (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.13031558.1483398147000.41761.1502875140678@Atlassian.JIRA>
In-Reply-To: <JIRA.13031558.1483398147000@Atlassian.JIRA>
References: <JIRA.13031558.1483398147000@Atlassian.JIRA> <JIRA.13031558.1483398147956@jira-lw-us.apache.org>
Subject: [jira] [Commented] (HBASE-17406) Occasional failed splits on
 branch-1.3
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Wed, 16 Aug 2017 09:19:10 -0000


    [ https://issues.apache.org/jira/browse/HBASE-17406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128532#comment-16128532 ] 

ramkrishna.s.vasudevan commented on HBASE-17406:
------------------------------------------------

After spending some considerable amout of time I have tried different cases here.
First case
=======
A region undergoes compaction. There are no active scans. It goes for a split. And parallely the CompactedFileDischarger also runs.
Say we had 6 files initially and 3 files got compacted. So now under StoreFileManager(SFM) we will have 4 files and under compacted files we will have 3 files. 
So when theCompactedFileDischarger   runs first it will try to archive the 3 compacted files. When this region is getting closed as part of split() it will wait till the CompactedFileDischarger completes (due to archiveLock). When the splitted parent region closes it will only have those 4 files on which it will create the references and only those files are getting opened by the daughter regions.

Second case:
=========
Same as above if the close() of parent region happens first, then the close gets completed and again creates references for the 4 files and the 3 compacted files are moved to archive(CompactedFileDischarger waits due to archiveLock). So the new daugher regions work on the references created over the 4 files.

Third case
=======
Considering the fact that as in the above case when the region was getting closed if there was a scanner active on those 3 compacted files. If the CompactedFileDischarger tries to archive those 3 files then it won't happen. When the parent  close() happens the scanner might be closed or not closed. In that case the 3 compacted files may still be available in the parent region directory (as it was not archived).
But the Hstore#close() is such that it will only split the store files that are in the SFM's store file list for creating references. So it is only the 4 active files on which the split happens and so the daughter regions seems to work fine as there is no way it will know about the compacted files in the parent. 

In all the above things there is no clear way where I get a split failure due to file not found. 

How ever there is an extn to the above cases where the scan on the parent region is so big that after opening the daughter regions even the compaction of the daughter regions gets completed and so all the references to the parent region is removed and the Catalog Janitor sees that the parent is not having any references and so goes ahead and removes the region itself. (I have not reproduced this yet just saying theoretically). But in this case I feel even before this HBASE-13082 it could have happened.

Will dig in more and keep updating here. Will be back.


> Occasional failed splits on branch-1.3
> --------------------------------------
>
>                 Key: HBASE-17406
>                 URL: https://issues.apache.org/jira/browse/HBASE-17406
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Compaction, regionserver
>    Affects Versions: 1.3.0
>            Reporter: Mikhail Antonov
>             Fix For: 1.3.2
>
>
> We observed occasional (rare) failed splits on branch-1.3 builds, that might be another echo of HBASE-13082.
> Essentially here's what seems to be happening:
> First on the RS hosting to-be-split parent seems to see some FileNotFoundException's in the logs. It could be simple file not found on some scanner path:
> {quote}
> 16/11/21 07:19:28 WARN hdfs.DFSClient: DFS Read
> java.io.FileNotFoundException: File does not exist: <path to HFile>
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> ....
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:733)
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1461)
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1715)
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1560)
> 	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:454)
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:271)
> 	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:651)
> 	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:631)
> 	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:292)
> 	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:201)
> 	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:412)
> 	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.requestSeek(StoreFileScanner.java:375)
> 	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:310)
> 	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268)
> 	at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:889)
> 	at org.apache.hadoop.hbase.regionserver.StoreScanner.seekToNextRow(StoreScanner.java:867)
> 	at org.apache.hadoop.hbase.regionserver.Stor
> {quote}
> Or it could be warning from HFileArchiver:
> {quote}
> 16/11/21 07:20:44 WARN backup.HFileArchiver: Failed to archive class org.apache.hadoop.hbase.backup.HFileArchiver$FileableStoreFile, <HFile path> because it does not exist! Skipping and continuing on.
> java.io.FileNotFoundException: File/Directory <HFile Path> does not exist.
> 	at org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.setTimes(FSDirAttrOp.java:121)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setTimes(FSNamesystem.java:1910)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setTimes(NameNodeRpcServer.java:1223)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setTimes(ClientNamenodeProtocolServerSideTranslatorPB.java:915)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> 	at or
> {quote}
> Then on the RS hosting parent I'm seeing:
> {quote}16/11/21 18:03:17 ERROR regionserver.HRegion: Could not initialize all stores for the region=<region name>
> 16/11/21 18:03:17 ERROR regionserver.HRegion: Could not initialize all stores for the <region name>
> 16/11/21 18:03:17 WARN ipc.Client: interrupted waiting to send rpc request to server
> java.lang.InterruptedException
> 	at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
> 	at java.util.concurrent.FutureTask.get(FutureTask.java:191)
> 	at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1060)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1455)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1413)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> 	at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
> 	at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> 	at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
> 	at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:302)
> 	at com.sun.proxy.$Proxy18.getFileInfo(Unknown Source)
> 	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2112)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
> 	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
> 	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
> 	at org.apache.hadoop.hbase.regionserver.HRegionFileSystem.createStoreDir(HRegionFileSystem.java:171)
> 	at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:224)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:5185)
> 	at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:926)
> 	at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:923)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> ...
> 16/11/21 18:03:17 FATAL regionserver.HRegionServer: ABORTING region server <server name>: Abort; we got an error after point-of-no-return
> {quote}
> So we've got past PONR and abort; then on the RSs where daughters are to be opened I'm seeing:
> {quote}
> 16/11/21 18:03:43 ERROR handler.OpenRegionHandler: Failed open of region= <region name>, starting to roll back the global memstore size.
> java.io.IOException: java.io.IOException: java.io.FileNotFoundException: File does not exist: < HFile name>
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:952)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:827)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:802)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6708)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6669)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6640)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6596)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6547)
> 	at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:362)
> 	at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129)
> 	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> {quote}
> And regions remains offline. There's no dataloss here as daughters never open up and the failed split could be recovered manually using the following procedure:
>  - manually remove daughters from hbase:meta
>  - move daughter region HDFS directories out of the way
> -  delete the parent region from hbase:meta
>  - hbck -fixMeta to add the parent region back
>  - failover the active master
>  - hbck -fixAssignments after master startup


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)