hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-1076) Periodic checkpointing cannot resume if the secondary name-node fails.
Date Tue, 28 Aug 2007 08:59:30 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

dhruba borthakur updated HADOOP-1076:
-------------------------------------

    Attachment: secondaryRestart.patch

This patch allows the secondary namenode to restart without restarting the primary namenode.

If rollEditLog finds that the edits log already exists, then it simply returns success. rollFsImage
fails if it was not preceeded by a call to rollEditLog. This lock-step ensures that a stale
instance of a secondary namenode cannot fool the primary namenode into uploading a  stale
fsimage file.

> Periodic checkpointing cannot resume if the secondary name-node fails.
> ----------------------------------------------------------------------
>
>                 Key: HADOOP-1076
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1076
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Konstantin Shvachko
>             Fix For: 0.15.0
>
>         Attachments: secondaryRestart.patch
>
>
> If secondary name-node fails during checkpointing then the primary node will have 2 edits
file.
> "edits" - is the one which current checkpoint is to be based upon.
> "edits.new" - is where new name space edits are currently logged.
> The problem is that the primary node cannot do checkpointing until "edits.new" file is
in place.
> That is, even if the secondary name-node is restarted periodic checkpointing is not going
to be resumed.
> In fact the primary node will be throwing an exception complaining about the existing
"edits.new"
> There is only one way to get rid of the edits.new file - to restart the primary name-node.
> So in a way if secondary name-node fails then you should restart the whole cluster.
> Here is a rather simple modification to the current approach, which we discussed with
Dhruba.
> When secondary node requests to rollEditLog() the primary node should roll the edit log
only if
> it has not been already rolled. Otherwise the existing "edits" file will be used for
checkpointing
> and the primary node will keep accumulating new edits in the "edits.new".
> In order to make it work the primary node should also ignore any rollFSImage() requests
when it
> already started to perform one. Otherwise the new image can become corrupted if two secondary
> nodes request to rollFSImage() at the same time.
> 2. Also, after the periodic checkpointing patch HADOOP-227 I see pieces of unusable code.
> I noticed one data member SecondaryNameNode.localName and at least 4 methods in FSEditLog
> that are not used anywhere. We should remove them and others alike if found.
> Supporting unusable code is such a waist of time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message