hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Heng Chen (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HBASE-16464) archive folder grows bigger and bigger due to corrupt snapshot under tmp dir
Date Mon, 22 Aug 2016 15:34:20 GMT

    [ https://issues.apache.org/jira/browse/HBASE-16464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15431008#comment-15431008
] 

Heng Chen edited comment on HBASE-16464 at 8/22/16 3:33 PM:
------------------------------------------------------------

I lost the relates hmaster log when create snapshot (deleted by some program). That's very
bad.

But i doubt that if there is any race condition between cleaner and TakeSnapshotHandler? 
When TakeSnapshotHandler try to delete relates files after create snapshot failed,  the SnapshotCleaner
read the relates files at the same time,  so delete in TakeSnapshotHandler will failed?  

Let me write a testcase to test it.


was (Author: chenheng):
I missed the relates hmaster log when create snapshot. That's very bad.

But i doubt that if there is any race condition between cleaner and TakeSnapshotHandler? 
When TakeSnapshotHandler try to delete relates files after create snapshot failed,  the SnapshotCleaner
read the relates files at the same time,  so delete in TakeSnapshotHandler will failed?  

Let me write a testcase to test it.

> archive folder grows bigger and bigger due to corrupt snapshot under tmp dir
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-16464
>                 URL: https://issues.apache.org/jira/browse/HBASE-16464
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.1.1
>            Reporter: Heng Chen
>            Assignee: Heng Chen
>             Fix For: 2.0.0, 1.3.0, 1.4.0, 1.1.6, 1.2.3
>
>         Attachments: HBASE-16464-branch-1.1.patch, HBASE-16464.patch
>
>
> We met the problem on our real production cluster,  we need to cleanup some data on hbase,
 we notice the archive folder is much larger than others,  so we delete all snapshots of all
tables,  but the archive folder still grows bigger and bigger. 
> After check the hmaster log, we notice the exception below:
> {code}
> 2016-08-22 15:34:33,089 ERROR [f04,16000,1471240833208_ChoreService_1] snapshot.SnapshotHFileCleaner:
Exception while checking if files were valid, keeping them just in case.
> org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Couldn't read snapshot info
from:hdfs://f04/hbase/.hbase-snapshot/.tmp/frog_stastic_2016-08-17/.snapshotinfo
>         at org.apache.hadoop.hbase.snapshot.SnapshotDescriptionUtils.readSnapshotInfo(SnapshotDescriptionUtils.java:295)
>         at org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.getHFileNames(SnapshotReferenceUtil.java:328)
>         at org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner$1.filesUnderSnapshot(SnapshotHFileCleaner.java:85)
>         at org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.getSnapshotsInProgress(SnapshotFileCache.java:303)
>         at org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.getUnreferencedFiles(SnapshotFileCache.java:194)
>         at org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner.getDeletableFiles(SnapshotHFileCleaner.java:62)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:233)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:157)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149)
>         at org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:124)
>         at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/.hbase-snapshot/.tmp/frog_stastic_2016-08-17/.snapshotinfo
>         at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
>         at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:587)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
> {code}
> It means when SnapshotHFileCleaner begin to cleanup the archive folder, it reads the
the snapshot dir to check if any links to hfiles exist, but when read the file /.hbase-snapshot/.tmp/frog_stastic_2016-08-17/.snapshotinfo,
corrupt exception thrown out (not sure why the file not found),  and cleanup will be failed.
> When i check the /.hbase-snapshot/.tmp/frog_stastic_2016-08-17, i can see there is only
one file exist /hbase/.hbase-snapshot/.tmp/frog_stastic_2016-08-17/region-manifest.8e3179c388e10770eba7d35e30f2777f,
 /hbase/.hbase-snapshot/.tmp/frog_stastic_2016-08-17/.snapshotinfo missed. 
> I think we should catch up the exception and delete the file to ensure cleanup will go
on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message