hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9696) Garbage snapshot records lingering forever
Date Fri, 12 Aug 2016 14:48:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418942#comment-15418942

Kihwal Lee commented on HDFS-9696:

It turns out that HDFS-9406 is not related to this issue.

The garbage snapshot filediffs with snapshotId=-1 were being generated by a bug fixed in HDFS-7056
by [~zero45]. 
   /** Is this inode in the latest snapshot? */
   public final boolean isInLatestSnapshot(final int latestSnapshotId) {
-    if (latestSnapshotId == Snapshot.CURRENT_STATE_ID) {
+    if (latestSnapshotId == Snapshot.CURRENT_STATE_ID ||
+        latestSnapshotId == Snapshot.NO_SNAPSHOT_ID) {
       return false;
[~shv] explained,
(7) Plamen says this is because Snapshot.findLatestSnapshot() may return NO_SNAPSHOT_ID, which
breaks recordModification() if you don't have that additional check. We see it when commitBlockSynchronization()
is called for truncated block.

We have actually traced the generation of these filediff entries to {{commitBlockSynchronization()}}
activities when the NN was running 2.5. This stops in 2.7 thanks to HDFS-7056.  However, the
garbage lives on until those files are deleted.  Can we have a sanity check during snapshot
diff loading so that these entries can be discarded?

> Garbage snapshot records lingering forever
> ------------------------------------------
>                 Key: HDFS-9696
>                 URL: https://issues.apache.org/jira/browse/HDFS-9696
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.7.2
>            Reporter: Kihwal Lee
>            Priority: Critical
> We have a cluster where the snapshot feature might have been tested years ago. When the
HDFS does not have any snapshot, but I see filediff records persisted in its fsimage.  Since
it has been restarted many times and checkpointed over 100 times since then, it must haven
been persisted and  carried over since then.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message