hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Pol (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4015) Safemode should count and report orphaned blocks
Date Fri, 17 Mar 2017 00:35:42 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929237#comment-15929237
] 

Daniel Pol commented on HDFS-4015:
----------------------------------

RE:"In this patch we track blocks with generation stamp greater than the current highest generation
stamp that is known to NN. I have made the assumption that if DN comes back on-line and reports
blocks for files that have been deleted, those Generation IDs for those blocks will be lesser
than the current Generation Stamp of NN. Please let me know if you think this assumption is
not valid or breaks down in special cases, Could this happen with V1 vs V2 generation stamps
?"

I'm hitting the case with same Generation ID quite often during testing. Test scenario is
run Teragen and for various reasons (mostly Hadoop settings) datanode service on some nodes
dies abruptly (think power failures also). While the bad nodes are down, you delete the Teragen
output folder (to free up space on the remaining good nodes that now are trying to maintain
the replication factor with less nodes). Once all nodes are up again and running the bad nodes
have orphaned blocks with the same Generation IDs. Right now its pretty painful to get rid
of those manually. 

> Safemode should count and report orphaned blocks
> ------------------------------------------------
>
>                 Key: HDFS-4015
>                 URL: https://issues.apache.org/jira/browse/HDFS-4015
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Todd Lipcon
>            Assignee: Anu Engineer
>             Fix For: 2.8.0, 3.0.0-alpha1
>
>         Attachments: HDFS-4015.001.patch, HDFS-4015.002.patch, HDFS-4015.003.patch, HDFS-4015.004.patch,
HDFS-4015.005.patch, HDFS-4015.006.patch, HDFS-4015.007.patch
>
>
> The safemode status currently reports the number of unique reported blocks compared to
the total number of blocks referenced by the namespace. However, it does not report the inverse:
blocks which are reported by datanodes but not referenced by the namespace.
> In the case that an admin accidentally starts up from an old image, this can be confusing:
safemode and fsck will show "corrupt files", which are the files which actually have been
deleted but got resurrected by restarting from the old image. This will convince them that
they can safely force leave safemode and remove these files -- after all, they know that those
files should really have been deleted. However, they're not aware that leaving safemode will
also unrecoverably delete a bunch of other block files which have been orphaned due to the
namespace rollback.
> I'd like to consider reporting something like: "900000 of expected 1000000 blocks have
been reported. Additionally, 10000 blocks have been reported which do not correspond to any
file in the namespace. Forcing exit of safemode will unrecoverably remove those data blocks"
> Whether this statistic is also used for some kind of "inverse safe mode" is the logical
next step, but just reporting it as a warning seems easy enough to accomplish and worth doing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message