hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-4811) race condition between 2 namenodes in standby that are trying to checkpoint with one another can delete or corrupt a good fsimage
Date Thu, 09 May 2013 21:41:16 GMT
Chris Nauroth created HDFS-4811:
-----------------------------------

             Summary: race condition between 2 namenodes in standby that are trying to checkpoint
with one another can delete or corrupt a good fsimage
                 Key: HDFS-4811
                 URL: https://issues.apache.org/jira/browse/HDFS-4811
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: ha
    Affects Versions: 3.0.0, 2.0.5-beta
            Reporter: Chris Nauroth


The problem occurs under concurrent execution of the namenode running its own checkpoint in
{{StandbyCheckpointer}} in thread 1 while also getting a checkpoint from a different namenode
in {{GetImageServlet}} in thread 2.  It is possible for thread 2 to finish writing the checkpoint
to the directory, but then get suspended before it has a chance to rename it to its final
destination as an fsimage file.  Then, thread 1 wakes up and starts writing its own data to
the checkpoint file.  When thread 2 resumes, it then tries to rename the file that thread
1 still holds open for writing.  Depending on OS, this either moves thread 1's incomplete
checkpoint to fsimage, or it just outright deletes the existing good fsimage until thread
1 finishes writing and renames.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message