hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
Date Wed, 03 Feb 2010 21:11:28 GMT

    [ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829262#action_12829262
] 

Todd Lipcon commented on HDFS-909:
----------------------------------

Hi Konstantin. Thanks for the in depth exploration of that race.

Let me ask a question - sorry if I'm being dense here :)

In FSImage.saveFSImage(boolean) (FSImage.java:1243) we create a new "current" edit log file
before rolling the FSImage.
This seems to be related to the race, and I don't quite understand the design here. It seems
to me that the correct operation
of saveFSImage should be:

1. Dump the image in IMAGE_NEW
2. Create a new empty edit log in EDITS_NEW (while leaving EDITS alone)
3. Call purgeEditLogs which rolls the new image/edits pair in.

Analyzing the failure cases:
- During step 1, if the NN restarts, it can recover from the preexisting IMAGE/EDITS as if
the saveFSImage never started
- Between step 1 and 2, it can recover from either the preexisting IMAGE/EDITS or the IMAGE_NEW.
IMAGE_NEW is consistent and up to date because saveFSImage is synchronized, and thus it represents
the exact state
of the namesystem.
- The same goes for step 2 - the empty edit log is correct against the new log. If we decide
to recover from current instead
of NEW at this point it would also be correct.
- We must assume that step 3 performs rolling in a way that allows recovery - this is what
allows rolling to work in general
safely.

In the implementation as I understand it, step 2 is implemented differently such that it calls
createEditLogFile on EDITS, not
just EDITS_NEW. This truncates the edits, which means we can only recover from the NEW image
at this point. This makes
for tricky recovery logic, since we have an old image with truncated edit log, plus a new
image (which we don't have any great way
of knowing is complete). Additionally, I don't think there's a guarantee that steps 1 and
2 happen in that order - the order of
saveFSImage is determined by the order of the storage directories in the directory iterator.
It seems like edits _usually_
come last, but in the case of failed/recovered storage directories, I don't think it's always
true. If the order of step 1 and 2 are
inverted, it could truncate the current edits log before saving the new image, and thus end
up losing all those edits if the
NN failed in between.

Am I misunderstanding something? The simple fix to the issue you suggested above is to add
a waitForSyncToFinish in
createEditLogFile, but I'm worried there may be more subtle issues as described above.

Thanks!

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations  corrupts
edits log
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-909
>                 URL: https://issues.apache.org/jira/browse/HDFS-909
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
>         Environment: CentOS
>            Reporter: Cosmin Lehene
>            Assignee: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation resulting
in OP_INVALID end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush)
threads and then removed twice from the buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream()
-> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream()
-> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog()
-> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush()

> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog()
-> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.flush()
-> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.flush()
-> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the FSEdits.logSync() method
level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush()
or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in
the example above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write operation
it can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte
(normally the FSEditsLog.OP_INVALID which is the "end-of-file marker") and then remove it
twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from
the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon
cluster restart. 
> We got to this point after investigating a corrupted edits file that made HDFS unable
to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is
768. 
>         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message