hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-2692) HA: Bugs related to failover from/into safe-mode
Date Wed, 21 Dec 2011 02:05:30 GMT

     [ https://issues.apache.org/jira/browse/HDFS-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Todd Lipcon updated HDFS-2692:

    Attachment: hdfs-2692.txt

Attached patch fixes the issue described.

The fix turned out to be reasonably simple: the root of the issue is that we were calling
{{incrementSafeBlockCount}} after receiving edits but without calling {{setBlockTotal}} in
between to update the safe mode state. So, as a fix, I delayed all of the {{notifyGenStampUpdate}}
calls in {{FSEditLogLoader}} until after the edits have all been processed, and call {{setBlockTotal}}
just before that.

The only other notable code change was to remove the optimization in {{removeBlock}} that
keeps the DNs from acking blocks removed due to file deletions. We should think about how
important that optimization is and whether it's actually "safe" - it was breaking one of the
new unit tests but it may be just fine in "real life".

Lastly, I cleaned up the code that threw the original assertion error so that it would provide
some actionable details as part of the assertion message.
> HA: Bugs related to failover from/into safe-mode
> ------------------------------------------------
>                 Key: HDFS-2692
>                 URL: https://issues.apache.org/jira/browse/HDFS-2692
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha, name-node
>    Affects Versions: HA branch (HDFS-1623)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>         Attachments: hdfs-2692.txt
> In testing I saw an AssertionError come up several times when I was trying to do failover
between two NNs where one or the other was in safe-mode. Need to write some unit tests to
try to trigger this -- hunch is it has something to do with the treatment of "safe block count"
while tailing edits in safemode.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message