hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
Date Tue, 02 Feb 2010 03:36:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828428#action_12828428
] 

Todd Lipcon commented on HDFS-909:
----------------------------------

bq. May be we should synchronize on the EditLogFileOutputStream level, then flushAndSync()
and setReadyToFlush() will not intervene with each other?

If we do that, isn't there a worry that each stream could flush at a slightly different point?
If we do HDFS-903 and add some kind of checksumming/verification to the edit logs (and we
should)
then we rely on them rolling together. (Maybe I'm misunderstanding the locking you're suggesting?)

bq. In TestEditLogRace you call FSEditLog methods directly avoiding the FSNamesystem layer.
Would it be better to involve the FSNamesystem lock into the picture?

I don't think the namesystem lock really matters, since all the write operations are separated
into an internalFoo()
which does the work while synchronized, followed by the logSync unsynchronized. I was trying
to keep the test
as close to a "unit test" as possible though I did have to use a minicluster anyway.

bq. I fear there is a similar race here, which it does not solve...

Let me add a unit test for this one as well and see what I come up with. I had forgotten about
saveNamespace. Thanks

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations  corrupts
edits log
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-909
>                 URL: https://issues.apache.org/jira/browse/HDFS-909
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
>         Environment: CentOS
>            Reporter: Cosmin Lehene
>            Assignee: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hdfs-909-unittest.txt, hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation resulting
in OP_INVALID end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush)
threads and then removed twice from the buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream()
-> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream()
-> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog()
-> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush()

> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog()
-> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.flush()
-> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.flush()
-> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the FSEdits.logSync() method
level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush()
or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in
the example above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write operation
it can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte
(normally the FSEditsLog.OP_INVALID which is the "end-of-file marker") and then remove it
twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from
the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon
cluster restart. 
> We got to this point after investigating a corrupted edits file that made HDFS unable
to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is
768. 
>         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> In the edits file we found the first 2 entries:
> {code:title=edits}
> FFFFFFEC0900000005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F38333238313438373139303730333137323739000133000D31323631303832363331383335000D313236313038323632383039340008363731303838363400000003F6CBB87EF376E3E600000000040000000000000000039665F9549DE069A5735E00000000040000000000000000039665ADCC71A050B16ABF00000000015A179A0000000000039665066861646F6F700A737570657267726F7570010100000003003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F3833323831343837313930373033313732373900352F68626173652F64656D6F5F5F75736572732F3336343035313634362F746573742F36393137333831323838333034343734333836000D3132363130383236333138363902
> ...
> {code}
> This is the completeFile operation that's missing the last byte 
> {code:title=completeFile}
> FFFFFFEC0900000005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F38333238313438373139303730333137323739000133000D31323631303832363331383335000D313236313038323632383039340008363731303838363400000003F6CBB87EF376E3E600000000040000000000000000039665F9549DE069A5735E00000000040000000000000000039665ADCC71A050B16ABF00000000015A179A0000000000039665066861646F6F700A737570657267726F757001??
> {code}
> followed by a rename operation
> {code:Title=rename}
> 0100000003003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F3833323831343837313930373033313732373900352F68626173652F64656D6F5F5F75736572732F3336343035313634362F746573742F36393137333831323838333034343734333836000D31323631303832363331383639
> {code}
> The first byte of the rename was instead read as part of the completeFile() operation.
This resulted in reading the next operation as 0x00 (OP_ADD) followed by an int (length) which
would have been 0x0000030 which is 768 that was read and failed in the following code
> {code:Title=FSEditLog.java}
>  case OP_ADD:
>         case OP_CLOSE: {
>           // versions > 0 support per file replication
>           // get name and replication
>           int length = in.readInt();
>           if (-7 == logVersion && length != 3||
>               -17 < logVersion && logVersion < -7 && length !=
4 ||
>               logVersion <= -17 && length != 5) {
>               throw new IOException("Incorrect data format."  +
>                                     " logVersion is " + logVersion +
>                                     " but writables.length is " +
>                                     length + ". ");
> {code}
> Filling the missing byte with a value resulted in correct interpretation of the 0x01
(OP_RENAME) and correct parsing for the rest of the edits file (>1MB)
> This theory is also supported by the NameNode log file from the date the corruption took
place:
> {code:title=namenode.log}
> 2009-12-17 12:43:51,276 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll
Edit Log from 10.72.17.162
> 2009-12-17 12:43:51,338 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.165:50010 is added to blk_-480585673051114658_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW],
ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864
> 2009-12-17 12:43:51,339 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.166:50010 is added to blk_-480585673051114658_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW],
ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864
> 2009-12-17 12:43:51,342 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock:
/hbase/demo__users/compaction.dir/364051646/8328148719070317279. blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW],
ReplicaUnderConstruction[10.72.17.163:50010|RBW]]}
> 2009-12-17 12:43:51,352 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.167:50010 is added to blk_-480585673051114658_235109{blockUCState=COMMITTED,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW],
ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864
> 2009-12-17 12:43:51,833 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.163:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW],
ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498
> 2009-12-17 12:43:51,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.164:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW],
ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498
> 2009-12-17 12:43:51,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.166:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW],
ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498
> 2009-12-17 12:43:51,835 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile:
file /hbase/demo__users/compaction.dir/364051646/8328148719070317279 is closed by DFSClient_-1989779667
> 2009-12-17 12:43:51,835 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll
FSImage from 10.72.17.162
> 2009-12-17 12:43:51,870 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=hadoop,hadoop       ip=/10.72.17.166        cmd=rename      src=/hbase/demo__users/compaction.dir/364051646/8328148719070317279
    dst=/hbase/demo__users/364051646/test/6917381288304474386       perm=hadoop:supergroup:rw-r--r--
> {code}
> The last 3 entries show a completeFile operation followed by a rollFSIMage operation
followed by a rename operation. This is where most probably the race condition took place.
> Synchronizing access to EditLogOutputStream could fix the problem, however other race
or deadlocks may still occur. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message