hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3519) Checkpoint upload may interfere with a concurrent saveNamespace
Date Tue, 13 Jan 2015 00:34:35 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14274486#comment-14274486
] 

Hadoop QA commented on HDFS-3519:
---------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12691715/HDFS-3519-2.patch
  against trunk revision b78b4a1.

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 1 new or modified
test files.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of
javac compiler warnings.

    {color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new Findbugs (version
2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number
of release audit warnings.

    {color:red}-1 core tests{color}.  The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

                  org.apache.hadoop.hdfs.qjournal.client.TestQuorumJournalManager
                  org.apache.hadoop.hdfs.server.balancer.TestBalancer
                  org.apache.hadoop.hdfs.TestReplaceDatanodeOnFailure

Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9187//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9187//console

This message is automatically generated.

> Checkpoint upload may interfere with a concurrent saveNamespace
> ---------------------------------------------------------------
>
>                 Key: HDFS-3519
>                 URL: https://issues.apache.org/jira/browse/HDFS-3519
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Todd Lipcon
>            Assignee: Ming Ma
>            Priority: Critical
>         Attachments: HDFS-3519-2.patch, HDFS-3519.patch, test-output.txt
>
>
> TestStandbyCheckpoints failed in [precommit build 2620|https://builds.apache.org/job/PreCommit-HDFS-Build/2620//testReport/]
due to the following issue:
> - both nodes were in Standby state, and configured to checkpoint "as fast as possible"
> - NN1 starts to save its own namespace
> - NN2 starts to upload a checkpoint for the same txid. So, both threads are writing to
the same file fsimage.ckpt_12, but the actual file contents correspond to the uploading thread's
data.
> - NN1 finished its saveNamespace operation while NN2 was still uploading. So, it renamed
the ckpt file. However, the contents of the file are still empty since NN2 hasn't sent any
bytes
> - NN2 finishes the upload, and the rename() call fails, which causes the directory to
be marked failed, etc.
> The result is that there is a file fsimage_12 which appears to be a finalized image but
in fact is incompletely transferred. When the transfer completes, the problem "heals itself"
so there wouldn't be persistent corruption unless the machine crashes at the same time. And
even then, we'd still have the earlier checkpoint to restore from.
> This same race could occur in a non-HA setup if a user puts the NN in safe mode and issues
saveNamespace operations concurrent with a 2NN checkpointing, I believe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message