[ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828879#action_12828879
]
Hadoop QA commented on HDFS-909:
--------------------------------
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12434606/hdfs-909.txt
against trunk revision 905760.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 2 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs warnings.
+1 release audit. The applied patch does not increase the total number of release audit
warnings.
-1 core tests. The patch failed core unit tests.
-1 contrib tests. The patch failed contrib unit tests.
Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/217/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/217/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/217/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/217/console
This message is automatically generated.
> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts
edits log
> -----------------------------------------------------------------------------------------------------
>
> Key: HDFS-909
> URL: https://issues.apache.org/jira/browse/HDFS-909
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: name-node
> Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
> Environment: CentOS
> Reporter: Cosmin Lehene
> Assignee: Todd Lipcon
> Priority: Blocker
> Fix For: 0.21.0, 0.22.0
>
> Attachments: hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt, hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation resulting
in OP_INVALID end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush)
threads and then removed twice from the buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream()
-> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream()
-> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog()
-> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog()
-> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.flush()
-> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.flush()
-> EditLogFileOutputStream.flushAndSync()
> OR
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the FSEdits.logSync() method
level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush()
or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in
the example above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write operation
it can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte
(normally the FSEditsLog.OP_INVALID which is the "end-of-file marker") and then remove it
twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from
the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon
cluster restart.
> We got to this point after investigating a corrupted edits file that made HDFS unable
to start with
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is
768.
> at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
|