hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashu Pachauri (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-17406) Occasional failed splits on branch-1.3
Date Wed, 07 Jun 2017 19:55:18 GMT

    [ https://issues.apache.org/jira/browse/HBASE-17406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16041507#comment-16041507
] 

Ashu Pachauri commented on HBASE-17406:
---------------------------------------

Linking HBASE-18186  as they seem related and we see Failed splits under similar conditions
on clusters that see exceptions outlined in HBASE-18186 .

> Occasional failed splits on branch-1.3
> --------------------------------------
>
>                 Key: HBASE-17406
>                 URL: https://issues.apache.org/jira/browse/HBASE-17406
>             Project: HBase
>          Issue Type: Bug
>          Components: Compaction, regionserver
>    Affects Versions: 1.3.0
>            Reporter: Mikhail Antonov
>             Fix For: 1.3.2
>
>
> We observed occasional (rare) failed splits on branch-1.3 builds, that might be another
echo of HBASE-13082.
> Essentially here's what seems to be happening:
> First on the RS hosting to-be-split parent seems to see some FileNotFoundException's
in the logs. It could be simple file not found on some scanner path:
> {quote}
> 16/11/21 07:19:28 WARN hdfs.DFSClient: DFS Read
> java.io.FileNotFoundException: File does not exist: <path to HFile>
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> ....
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:733)
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1461)
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1715)
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1560)
> 	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:454)
> 	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:271)
> 	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:651)
> 	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:631)
> 	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:292)
> 	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:201)
> 	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:412)
> 	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.requestSeek(StoreFileScanner.java:375)
> 	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:310)
> 	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268)
> 	at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:889)
> 	at org.apache.hadoop.hbase.regionserver.StoreScanner.seekToNextRow(StoreScanner.java:867)
> 	at org.apache.hadoop.hbase.regionserver.Stor
> {quote}
> Or it could be warning from HFileArchiver:
> {quote}
> 16/11/21 07:20:44 WARN backup.HFileArchiver: Failed to archive class org.apache.hadoop.hbase.backup.HFileArchiver$FileableStoreFile,
<HFile path> because it does not exist! Skipping and continuing on.
> java.io.FileNotFoundException: File/Directory <HFile Path> does not exist.
> 	at org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.setTimes(FSDirAttrOp.java:121)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setTimes(FSNamesystem.java:1910)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setTimes(NameNodeRpcServer.java:1223)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setTimes(ClientNamenodeProtocolServerSideTranslatorPB.java:915)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> 	at or
> {quote}
> Then on the RS hosting parent I'm seeing:
> {quote}16/11/21 18:03:17 ERROR regionserver.HRegion: Could not initialize all stores
for the region=<region name>
> 16/11/21 18:03:17 ERROR regionserver.HRegion: Could not initialize all stores for the
<region name>
> 16/11/21 18:03:17 WARN ipc.Client: interrupted waiting to send rpc request to server
> java.lang.InterruptedException
> 	at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
> 	at java.util.concurrent.FutureTask.get(FutureTask.java:191)
> 	at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1060)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1455)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1413)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> 	at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
> 	at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
> 	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> 	at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
> 	at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:302)
> 	at com.sun.proxy.$Proxy18.getFileInfo(Unknown Source)
> 	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2112)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
> 	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> 	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
> 	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
> 	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
> 	at org.apache.hadoop.hbase.regionserver.HRegionFileSystem.createStoreDir(HRegionFileSystem.java:171)
> 	at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:224)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:5185)
> 	at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:926)
> 	at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:923)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> ...
> 16/11/21 18:03:17 FATAL regionserver.HRegionServer: ABORTING region server <server
name>: Abort; we got an error after point-of-no-return
> {quote}
> So we've got past PONR and abort; then on the RSs where daughters are to be opened I'm
seeing:
> {quote}
> 16/11/21 18:03:43 ERROR handler.OpenRegionHandler: Failed open of region= <region
name>, starting to roll back the global memstore size.
> java.io.IOException: java.io.IOException: java.io.FileNotFoundException: File does not
exist: < HFile name>
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
> 	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
> 	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
> 	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588)
> 	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
> 	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
> 	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:952)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:827)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:802)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6708)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6669)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6640)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6596)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6547)
> 	at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:362)
> 	at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129)
> 	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> {quote}
> And regions remains offline. There's no dataloss here as daughters never open up and
the failed split could be recovered manually using the following procedure:
>  - manually remove daughters from hbase:meta
>  - move daughter region HDFS directories out of the way
> -  delete the parent region from hbase:meta
>  - hbck -fixMeta to add the parent region back
>  - failover the active master
>  - hbck -fixAssignments after master startup



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message