hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (JIRA)" <j...@apache.org>
Subject [jira] Assigned: (HADOOP-1076) Periodic checkpointing cannot resume if the secondary name-node fails.
Date Tue, 28 Aug 2007 09:01:30 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

dhruba borthakur reassigned HADOOP-1076:

    Assignee: dhruba borthakur

> Periodic checkpointing cannot resume if the secondary name-node fails.
> ----------------------------------------------------------------------
>                 Key: HADOOP-1076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1076
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Konstantin Shvachko
>            Assignee: dhruba borthakur
>             Fix For: 0.15.0
>         Attachments: secondaryRestart.patch
> If secondary name-node fails during checkpointing then the primary node will have 2 edits
> "edits" - is the one which current checkpoint is to be based upon.
> "edits.new" - is where new name space edits are currently logged.
> The problem is that the primary node cannot do checkpointing until "edits.new" file is
in place.
> That is, even if the secondary name-node is restarted periodic checkpointing is not going
to be resumed.
> In fact the primary node will be throwing an exception complaining about the existing
> There is only one way to get rid of the edits.new file - to restart the primary name-node.
> So in a way if secondary name-node fails then you should restart the whole cluster.
> Here is a rather simple modification to the current approach, which we discussed with
> When secondary node requests to rollEditLog() the primary node should roll the edit log
only if
> it has not been already rolled. Otherwise the existing "edits" file will be used for
> and the primary node will keep accumulating new edits in the "edits.new".
> In order to make it work the primary node should also ignore any rollFSImage() requests
when it
> already started to perform one. Otherwise the new image can become corrupted if two secondary
> nodes request to rollFSImage() at the same time.
> 2. Also, after the periodic checkpointing patch HADOOP-227 I see pieces of unusable code.
> I noticed one data member SecondaryNameNode.localName and at least 4 methods in FSEditLog
> that are not used anywhere. We should remove them and others alike if found.
> Supporting unusable code is such a waist of time.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message