hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-1981) When namenode goes down while checkpointing and if is started again subsequent Checkpointing is always failing
Date Sat, 25 Jun 2011 00:52:47 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054781#comment-13054781
] 

Konstantin Shvachko commented on HDFS-1981:
-------------------------------------------

Not sure what introduced it, but 
The problem is that NN does not saveNamespace() when editsNew is present.
This only happens in Ramakrishna's scenario, when editsNew is empty. That is when you start
the checkpoint, and fail NN before modifying anything in the namespace.

Deleting editsNew, is probably valid, but not consistent, since at this stage NN is in read-only
mode. That is if something goes wrong we should leave the storage directory in exactly the
same state as it was before the startup.

I propose to increment numEdits if editsNew exists. This will trigger saving namespace after
loading. So just one line change:
{code}
. if (editsNew.exists() && editsNew.length() > 0) {
+   numEdits ++;
    edits = new EditLogFileInputStream(editsNew);
    numEdits += loader.loadFSEdits(edits);
    edits.close();
  }
{code}
Well, may be not one line as you need to increment even if {{editsNew.length() == 0}}.

Your test should work in this case as well. Could you please convert it to JUnit4 and use
{{MiniDFSCluster.Builder}} instead of a direct constructor.

> When namenode goes down while checkpointing and if is started again subsequent Checkpointing
is always failing
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-1981
>                 URL: https://issues.apache.org/jira/browse/HDFS-1981
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.22.0
>         Environment: Linux
>            Reporter: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.22.0
>
>         Attachments: HDFS-1981.patch
>
>
> This scenario is applicable in NN and BNN case.
> When the namenode goes down after creating the edits.new, on subsequent restart the divertFileStreams
will not happen to edits.new as the edits.new file is already present and the size is zero.
> so on trying to saveCheckPoint an exception occurs 
> 2011-05-23 16:38:57,476 WARN org.mortbay.log: /getimage: java.io.IOException: GetImage
failed. java.io.IOException: Namenode has an edit log with timestamp of 2011-05-23 16:38:56
but new checkpoint was created using editlog  with timestamp 2011-05-23 16:37:30. Checkpoint
Aborted.
> This is a bug or is that the behaviour.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message