hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ramkrishna.s.vasudevan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-17406) Occasional failed splits on branch-1.3
Date Wed, 16 Aug 2017 09:19:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-17406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128532#comment-16128532
] 

ramkrishna.s.vasudevan commented on HBASE-17406:
------------------------------------------------

After spending some considerable amout of time I have tried different cases here.
First case
=======
A region undergoes compaction. There are no active scans. It goes for a split. And parallely
the CompactedFileDischarger also runs.
Say we had 6 files initially and 3 files got compacted. So now under StoreFileManager(SFM)
we will have 4 files and under compacted files we will have 3 files. 
So when theCompactedFileDischarger   runs first it will try to archive the 3 compacted files.
When this region is getting closed as part of split() it will wait till the CompactedFileDischarger
completes (due to archiveLock). When the splitted parent region closes it will only have those
4 files on which it will create the references and only those files are getting opened by
the daughter regions.

Second case:
=========
Same as above if the close() of parent region happens first, then the close gets completed
and again creates references for the 4 files and the 3 compacted files are moved to archive(CompactedFileDischarger
waits due to archiveLock). So the new daugher regions work on the references created over
the 4 files.

Third case
=======
Considering the fact that as in the above case when the region was getting closed if there
was a scanner active on those 3 compacted files. If the CompactedFileDischarger tries to archive
those 3 files then it won't happen. When the parent  close() happens the scanner might be
closed or not closed. In that case the 3 compacted files may still be available in the parent
region directory (as it was not archived).
But the Hstore#close() is such that it will only split the store files that are in the SFM's
store file list for creating references. So it is only the 4 active files on which the split
happens and so the daughter regions seems to work fine as there is no way it will know about
the compacted files in the parent. 

In all the above things there is no clear way where I get a split failure due to file not
found. 

How ever there is an extn to the above cases where the scan on the parent region is so big
that after opening the daughter regions even the compaction of the daughter regions gets completed
and so all the references to the parent region is removed and the Catalog Janitor sees that
the parent is not having any references and so goes ahead and removes the region itself. (I
have not reproduced this yet just saying theoretically). But in this case I feel even before
this HBASE-13082 it could have happened.

Will dig in more and keep updating here. Will be back.


> Occasional failed splits on branch-1.3
> --------------------------------------
>
>                 Key: HBASE-17406
>                 URL: https://issues.apache.org/jira/browse/HBASE-17406
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Compaction, regionserver
>    Affects Versions: 1.3.0
>            Reporter: Mikhail Antonov
>             Fix For: 1.3.2
>
>
> We observed occasional (rare) failed splits on branch-1.3 builds, that might be another
echo of HBASE-13082.
> Essentially here's what seems to be happening:
> First on the RS hosting to-be-split parent seems to see some FileNotFoundException's
in the logs. It could be simple file not found on some scanner path:
> {quote}
> 16/11/21 07:19:28 WARN hdfs.DFSClient: DFS Read
> java.io.FileNotFoundException: File does not exist: <path to HFile>
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> ....
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:733)
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1461)
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1715)
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1560)
> 	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:454)
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:271)
> 	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:651)
> 	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:631)
> 	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:292)
> 	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:201)
> 	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:412)
> 	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.requestSeek(StoreFileScanner.java:375)
> 	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:310)
> 	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268)
> 	at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:889)
> 	at org.apache.hadoop.hbase.regionserver.StoreScanner.seekToNextRow(StoreScanner.java:867)
> 	at org.apache.hadoop.hbase.regionserver.Stor
> {quote}
> Or it could be warning from HFileArchiver:
> {quote}
> 16/11/21 07:20:44 WARN backup.HFileArchiver: Failed to archive class org.apache.hadoop.hbase.backup.HFileArchiver$FileableStoreFile,
<HFile path> because it does not exist! Skipping and continuing on.
> java.io.FileNotFoundException: File/Directory <HFile Path> does not exist.
> 	at org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.setTimes(FSDirAttrOp.java:121)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setTimes(FSNamesystem.java:1910)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setTimes(NameNodeRpcServer.java:1223)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setTimes(ClientNamenodeProtocolServerSideTranslatorPB.java:915)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> 	at or
> {quote}
> Then on the RS hosting parent I'm seeing:
> {quote}16/11/21 18:03:17 ERROR regionserver.HRegion: Could not initialize all stores
for the region=<region name>
> 16/11/21 18:03:17 ERROR regionserver.HRegion: Could not initialize all stores for the
<region name>
> 16/11/21 18:03:17 WARN ipc.Client: interrupted waiting to send rpc request to server
> java.lang.InterruptedException
> 	at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
> 	at java.util.concurrent.FutureTask.get(FutureTask.java:191)
> 	at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1060)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1455)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1413)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> 	at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
> 	at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> 	at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
> 	at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:302)
> 	at com.sun.proxy.$Proxy18.getFileInfo(Unknown Source)
> 	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2112)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
> 	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
> 	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
> 	at org.apache.hadoop.hbase.regionserver.HRegionFileSystem.createStoreDir(HRegionFileSystem.java:171)
> 	at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:224)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:5185)
> 	at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:926)
> 	at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:923)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> ...
> 16/11/21 18:03:17 FATAL regionserver.HRegionServer: ABORTING region server <server
name>: Abort; we got an error after point-of-no-return
> {quote}
> So we've got past PONR and abort; then on the RSs where daughters are to be opened I'm
seeing:
> {quote}
> 16/11/21 18:03:43 ERROR handler.OpenRegionHandler: Failed open of region= <region
name>, starting to roll back the global memstore size.
> java.io.IOException: java.io.IOException: java.io.FileNotFoundException: File does not
exist: < HFile name>
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:952)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:827)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:802)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6708)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6669)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6640)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6596)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6547)
> 	at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:362)
> 	at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129)
> 	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> {quote}
> And regions remains offline. There's no dataloss here as daughters never open up and
the failed split could be recovered manually using the following procedure:
>  - manually remove daughters from hbase:meta
>  - move daughter region HDFS directories out of the way
> -  delete the parent region from hbase:meta
>  - hbck -fixMeta to add the parent region back
>  - failover the active master
>  - hbck -fixAssignments after master startup



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message