hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
Date Tue, 02 Feb 2010 04:02:20 GMT

     [ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Todd Lipcon updated HDFS-909:
-----------------------------

    Attachment: hdfs-909.txt

This patch adds a unit test for saveNamespace concurrent with edit ops.

There is no race here because createEditLogFile is creating a new output stream, and thus
won't race with any of the existing streams. The edit log isn't mutated until the image has
been dumped and then the logs are rolled, so it's the same rollEditLogs fixed originally.

As for the danger of future changes missing the subtlety, I added a giant comment at the top
of FSEditLog in the hopes that it will warn people. A different design with multiple locks
(one protecting the buffers, one protecting sync and stream ops) might make it harder to make
this mistake, but I think what we have isn't so bad as long as it's documented.

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations  corrupts
edits log
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-909
>                 URL: https://issues.apache.org/jira/browse/HDFS-909
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
>         Environment: CentOS
>            Reporter: Cosmin Lehene
>            Assignee: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation resulting
in OP_INVALID end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush)
threads and then removed twice from the buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream()
-> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream()
-> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog()
-> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush()

> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog()
-> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.flush()
-> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.flush()
-> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the FSEdits.logSync() method
level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush()
or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in
the example above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write operation
it can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte
(normally the FSEditsLog.OP_INVALID which is the "end-of-file marker") and then remove it
twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from
the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon
cluster restart. 
> We got to this point after investigating a corrupted edits file that made HDFS unable
to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is
768. 
>         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> In the edits file we found the first 2 entries:
> {code:title=edits}
> FFFFFFEC0900000005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F38333238313438373139303730333137323739000133000D31323631303832363331383335000D313236313038323632383039340008363731303838363400000003F6CBB87EF376E3E600000000040000000000000000039665F9549DE069A5735E00000000040000000000000000039665ADCC71A050B16ABF00000000015A179A0000000000039665066861646F6F700A737570657267726F7570010100000003003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F3833323831343837313930373033313732373900352F68626173652F64656D6F5F5F75736572732F3336343035313634362F746573742F36393137333831323838333034343734333836000D3132363130383236333138363902
> ...
> {code}
> This is the completeFile operation that's missing the last byte 
> {code:title=completeFile}
> FFFFFFEC0900000005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F38333238313438373139303730333137323739000133000D31323631303832363331383335000D313236313038323632383039340008363731303838363400000003F6CBB87EF376E3E600000000040000000000000000039665F9549DE069A5735E00000000040000000000000000039665ADCC71A050B16ABF00000000015A179A0000000000039665066861646F6F700A737570657267726F757001??
> {code}
> followed by a rename operation
> {code:Title=rename}
> 0100000003003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F3833323831343837313930373033313732373900352F68626173652F64656D6F5F5F75736572732F3336343035313634362F746573742F36393137333831323838333034343734333836000D31323631303832363331383639
> {code}
> The first byte of the rename was instead read as part of the completeFile() operation.
This resulted in reading the next operation as 0x00 (OP_ADD) followed by an int (length) which
would have been 0x0000030 which is 768 that was read and failed in the following code
> {code:Title=FSEditLog.java}
>  case OP_ADD:
>         case OP_CLOSE: {
>           // versions > 0 support per file replication
>           // get name and replication
>           int length = in.readInt();
>           if (-7 == logVersion && length != 3||
>               -17 < logVersion && logVersion < -7 && length !=
4 ||
>               logVersion <= -17 && length != 5) {
>               throw new IOException("Incorrect data format."  +
>                                     " logVersion is " + logVersion +
>                                     " but writables.length is " +
>                                     length + ". ");
> {code}
> Filling the missing byte with a value resulted in correct interpretation of the 0x01
(OP_RENAME) and correct parsing for the rest of the edits file (>1MB)
> This theory is also supported by the NameNode log file from the date the corruption took
place:
> {code:title=namenode.log}
> 2009-12-17 12:43:51,276 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll
Edit Log from 10.72.17.162
> 2009-12-17 12:43:51,338 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.165:50010 is added to blk_-480585673051114658_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW],
ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864
> 2009-12-17 12:43:51,339 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.166:50010 is added to blk_-480585673051114658_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW],
ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864
> 2009-12-17 12:43:51,342 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock:
/hbase/demo__users/compaction.dir/364051646/8328148719070317279. blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW],
ReplicaUnderConstruction[10.72.17.163:50010|RBW]]}
> 2009-12-17 12:43:51,352 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.167:50010 is added to blk_-480585673051114658_235109{blockUCState=COMMITTED,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW],
ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864
> 2009-12-17 12:43:51,833 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.163:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW],
ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498
> 2009-12-17 12:43:51,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.164:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW],
ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498
> 2009-12-17 12:43:51,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.166:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW],
ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498
> 2009-12-17 12:43:51,835 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile:
file /hbase/demo__users/compaction.dir/364051646/8328148719070317279 is closed by DFSClient_-1989779667
> 2009-12-17 12:43:51,835 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll
FSImage from 10.72.17.162
> 2009-12-17 12:43:51,870 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=hadoop,hadoop       ip=/10.72.17.166        cmd=rename      src=/hbase/demo__users/compaction.dir/364051646/8328148719070317279
    dst=/hbase/demo__users/364051646/test/6917381288304474386       perm=hadoop:supergroup:rw-r--r--
> {code}
> The last 3 entries show a completeFile operation followed by a rollFSIMage operation
followed by a rename operation. This is where most probably the race condition took place.
> Synchronizing access to EditLogOutputStream could fix the problem, however other race
or deadlocks may still occur. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message