Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0A247200CE6 for ; Wed, 16 Aug 2017 11:19:10 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 08C9A168652; Wed, 16 Aug 2017 09:19:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 04A4B168653 for ; Wed, 16 Aug 2017 11:19:08 +0200 (CEST) Received: (qmail 34952 invoked by uid 500); 16 Aug 2017 09:19:07 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 34941 invoked by uid 99); 16 Aug 2017 09:19:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Aug 2017 09:19:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 39F961A17AB for ; Wed, 16 Aug 2017 09:19:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.502 X-Spam-Level: X-Spam-Status: No, score=-99.502 tagged_above=-999 required=6.31 tests=[KAM_NUMSUBJECT=0.5, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id tcg4-2Zlr1MC for ; Wed, 16 Aug 2017 09:19:04 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id A8D2A5FDB2 for ; Wed, 16 Aug 2017 09:19:03 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 63685E0E2E for ; Wed, 16 Aug 2017 09:19:02 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id A644325387 for ; Wed, 16 Aug 2017 09:19:00 +0000 (UTC) Date: Wed, 16 Aug 2017 09:19:00 +0000 (UTC) From: "ramkrishna.s.vasudevan (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-17406) Occasional failed splits on branch-1.3 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 16 Aug 2017 09:19:10 -0000 [ https://issues.apache.org/jira/browse/HBASE-17406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128532#comment-16128532 ] ramkrishna.s.vasudevan commented on HBASE-17406: ------------------------------------------------ After spending some considerable amout of time I have tried different cases here. First case ======= A region undergoes compaction. There are no active scans. It goes for a split. And parallely the CompactedFileDischarger also runs. Say we had 6 files initially and 3 files got compacted. So now under StoreFileManager(SFM) we will have 4 files and under compacted files we will have 3 files. So when theCompactedFileDischarger runs first it will try to archive the 3 compacted files. When this region is getting closed as part of split() it will wait till the CompactedFileDischarger completes (due to archiveLock). When the splitted parent region closes it will only have those 4 files on which it will create the references and only those files are getting opened by the daughter regions. Second case: ========= Same as above if the close() of parent region happens first, then the close gets completed and again creates references for the 4 files and the 3 compacted files are moved to archive(CompactedFileDischarger waits due to archiveLock). So the new daugher regions work on the references created over the 4 files. Third case ======= Considering the fact that as in the above case when the region was getting closed if there was a scanner active on those 3 compacted files. If the CompactedFileDischarger tries to archive those 3 files then it won't happen. When the parent close() happens the scanner might be closed or not closed. In that case the 3 compacted files may still be available in the parent region directory (as it was not archived). But the Hstore#close() is such that it will only split the store files that are in the SFM's store file list for creating references. So it is only the 4 active files on which the split happens and so the daughter regions seems to work fine as there is no way it will know about the compacted files in the parent. In all the above things there is no clear way where I get a split failure due to file not found. How ever there is an extn to the above cases where the scan on the parent region is so big that after opening the daughter regions even the compaction of the daughter regions gets completed and so all the references to the parent region is removed and the Catalog Janitor sees that the parent is not having any references and so goes ahead and removes the region itself. (I have not reproduced this yet just saying theoretically). But in this case I feel even before this HBASE-13082 it could have happened. Will dig in more and keep updating here. Will be back. > Occasional failed splits on branch-1.3 > -------------------------------------- > > Key: HBASE-17406 > URL: https://issues.apache.org/jira/browse/HBASE-17406 > Project: HBase > Issue Type: Sub-task > Components: Compaction, regionserver > Affects Versions: 1.3.0 > Reporter: Mikhail Antonov > Fix For: 1.3.2 > > > We observed occasional (rare) failed splits on branch-1.3 builds, that might be another echo of HBASE-13082. > Essentially here's what seems to be happening: > First on the RS hosting to-be-split parent seems to see some FileNotFoundException's in the logs. It could be simple file not found on some scanner path: > {quote} > 16/11/21 07:19:28 WARN hdfs.DFSClient: DFS Read > java.io.FileNotFoundException: File does not exist: > at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) > at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712) > .... > at org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:733) > at org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1461) > at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1715) > at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1560) > at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:454) > at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:271) > at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:651) > at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:631) > at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:292) > at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:201) > at org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:412) > at org.apache.hadoop.hbase.regionserver.StoreFileScanner.requestSeek(StoreFileScanner.java:375) > at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:310) > at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268) > at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:889) > at org.apache.hadoop.hbase.regionserver.StoreScanner.seekToNextRow(StoreScanner.java:867) > at org.apache.hadoop.hbase.regionserver.Stor > {quote} > Or it could be warning from HFileArchiver: > {quote} > 16/11/21 07:20:44 WARN backup.HFileArchiver: Failed to archive class org.apache.hadoop.hbase.backup.HFileArchiver$FileableStoreFile, because it does not exist! Skipping and continuing on. > java.io.FileNotFoundException: File/Directory does not exist. > at org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.setTimes(FSDirAttrOp.java:121) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setTimes(FSNamesystem.java:1910) > at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setTimes(NameNodeRpcServer.java:1223) > at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setTimes(ClientNamenodeProtocolServerSideTranslatorPB.java:915) > at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at or > {quote} > Then on the RS hosting parent I'm seeing: > {quote}16/11/21 18:03:17 ERROR regionserver.HRegion: Could not initialize all stores for the region= > 16/11/21 18:03:17 ERROR regionserver.HRegion: Could not initialize all stores for the > 16/11/21 18:03:17 WARN ipc.Client: interrupted waiting to send rpc request to server > java.lang.InterruptedException > at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404) > at java.util.concurrent.FutureTask.get(FutureTask.java:191) > at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1060) > at org.apache.hadoop.ipc.Client.call(Client.java:1455) > at org.apache.hadoop.ipc.Client.call(Client.java:1413) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771) > at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source) > at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:302) > at com.sun.proxy.$Proxy18.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2112) > at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305) > at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) > at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317) > at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426) > at org.apache.hadoop.hbase.regionserver.HRegionFileSystem.createStoreDir(HRegionFileSystem.java:171) > at org.apache.hadoop.hbase.regionserver.HStore.(HStore.java:224) > at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:5185) > at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:926) > at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:923) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > ... > 16/11/21 18:03:17 FATAL regionserver.HRegionServer: ABORTING region server : Abort; we got an error after point-of-no-return > {quote} > So we've got past PONR and abort; then on the RSs where daughters are to be opened I'm seeing: > {quote} > 16/11/21 18:03:43 ERROR handler.OpenRegionHandler: Failed open of region= , starting to roll back the global memstore size. > java.io.IOException: java.io.IOException: java.io.FileNotFoundException: File does not exist: < HFile name> > at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) > at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712) > at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588) > at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365) > at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) > at org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:952) > at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:827) > at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:802) > at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6708) > at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6669) > at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6640) > at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6596) > at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6547) > at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:362) > at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129) > at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} > And regions remains offline. There's no dataloss here as daughters never open up and the failed split could be recovered manually using the following procedure: > - manually remove daughters from hbase:meta > - move daughter region HDFS directories out of the way > - delete the parent region from hbase:meta > - hbck -fixMeta to add the parent region back > - failover the active master > - hbck -fixAssignments after master startup -- This message was sent by Atlassian JIRA (v6.4.14#64029)