Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9A839200B79 for ; Tue, 23 Aug 2016 15:46:24 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 991DC160AAD; Tue, 23 Aug 2016 13:46:24 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id CF858160AC3 for ; Tue, 23 Aug 2016 15:46:23 +0200 (CEST) Received: (qmail 99513 invoked by uid 500); 23 Aug 2016 13:46:22 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 99371 invoked by uid 99); 23 Aug 2016 13:46:22 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Aug 2016 13:46:22 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 44E382C015B for ; Tue, 23 Aug 2016 13:46:22 +0000 (UTC) Date: Tue, 23 Aug 2016 13:46:22 +0000 (UTC) From: "Matteo Bertozzi (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-16464) archive folder grows bigger and bigger due to corrupt snapshot under tmp dir MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 23 Aug 2016 13:46:24 -0000 [ https://issues.apache.org/jira/browse/HBASE-16464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432807#comment-15432807 ] Matteo Bertozzi commented on HBASE-16464: ----------------------------------------- v1 looks ok, I was hoping for an in-memory "lock" in SnapshotManager instead of a file on disk. but I guess it is more work to pass the SnapshotManager around. +1 we can always optimize stuff later > archive folder grows bigger and bigger due to corrupt snapshot under tmp dir > ---------------------------------------------------------------------------- > > Key: HBASE-16464 > URL: https://issues.apache.org/jira/browse/HBASE-16464 > Project: HBase > Issue Type: Bug > Affects Versions: 1.1.1 > Reporter: Heng Chen > Assignee: Heng Chen > Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3 > > Attachments: HBASE-16464-branch-1.1.patch, HBASE-16464.patch, HBASE-16464.v1.patch > > > We met the problem on our real production cluster, we need to cleanup some data on hbase, we notice the archive folder is much larger than others, so we delete all snapshots of all tables, but the archive folder still grows bigger and bigger. > After check the hmaster log, we notice the exception below: > {code} > 2016-08-22 15:34:33,089 ERROR [f04,16000,1471240833208_ChoreService_1] snapshot.SnapshotHFileCleaner: Exception while checking if files were valid, keeping them just in case. > org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Couldn't read snapshot info from:hdfs://f04/hbase/.hbase-snapshot/.tmp/frog_stastic_2016-08-17/.snapshotinfo > at org.apache.hadoop.hbase.snapshot.SnapshotDescriptionUtils.readSnapshotInfo(SnapshotDescriptionUtils.java:295) > at org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.getHFileNames(SnapshotReferenceUtil.java:328) > at org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner$1.filesUnderSnapshot(SnapshotHFileCleaner.java:85) > at org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.getSnapshotsInProgress(SnapshotFileCache.java:303) > at org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.getUnreferencedFiles(SnapshotFileCache.java:194) > at org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner.getDeletableFiles(SnapshotHFileCleaner.java:62) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:233) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:157) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149) > at org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:124) > at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.FileNotFoundException: File does not exist: /hbase/.hbase-snapshot/.tmp/frog_stastic_2016-08-17/.snapshotinfo > at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) > at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712) > at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:587) > at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365) > at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > {code} > It means when SnapshotHFileCleaner begin to cleanup the archive folder, it reads the the snapshot dir to check if any links to hfiles exist, but when read the file /.hbase-snapshot/.tmp/frog_stastic_2016-08-17/.snapshotinfo, corrupt exception thrown out (not sure why the file not found), and cleanup will be failed. > When i check the /.hbase-snapshot/.tmp/frog_stastic_2016-08-17, i can see there is only one file exist /hbase/.hbase-snapshot/.tmp/frog_stastic_2016-08-17/region-manifest.8e3179c388e10770eba7d35e30f2777f, /hbase/.hbase-snapshot/.tmp/frog_stastic_2016-08-17/.snapshotinfo missed. > I think we should catch up the exception and delete the file to ensure cleanup will go on. -- This message was sent by Atlassian JIRA (v6.3.4#6332)