hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cosmin Lehene (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HDFS-909) Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations corrupts edits log
Date Tue, 02 Feb 2010 08:14:21 GMT

     [ https://issues.apache.org/jira/browse/HDFS-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Cosmin Lehene updated HDFS-909:
-------------------------------

    Description: 
closing the edits log file can race with write to edits log file operation resulting in OP_INVALID
end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush) threads
and then removed twice from the buffer, losing a good byte from edits log.

Example:
{code}
FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream()
-> EditLogOutputStream.setReadyToFlush()
FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream()
-> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
OR
FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog() ->
FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush()

FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog() ->
FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.flush()
-> EditLogFileOutputStream.flushAndSync()

VERSUS

FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.setReadyToFlush()
FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.flush() ->
EditLogFileOutputStream.flushAndSync()
OR 
Any FSEditLog.write
{code}

Access on the edits flush operations is synchronized only in the FSEdits.logSync() method
level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush()
or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in
the example above

So if a rollEditLog or rollFSIMage is happening at the same time with a write operation it
can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte
(normally the FSEditsLog.OP_INVALID which is the "end-of-file marker") and then remove it
twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from
the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon
cluster restart. 

We got to this point after investigating a corrupted edits file that made HDFS unable to start
with 

{code:title=namenode.log}
java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is 768.

        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
{code}

EDIT: moved the logs to a comment to make this readable


  was:
closing the edits log file can race with write to edits log file operation resulting in OP_INVALID
end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush) threads
and then removed twice from the buffer, losing a good byte from edits log.

Example:
{code}
FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream()
-> EditLogOutputStream.setReadyToFlush()
FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream()
-> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
OR
FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog() ->
FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush()

FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog() ->
FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.flush()
-> EditLogFileOutputStream.flushAndSync()

VERSUS

FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.setReadyToFlush()
FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.flush() ->
EditLogFileOutputStream.flushAndSync()
OR 
Any FSEditLog.write
{code}

Access on the edits flush operations is synchronized only in the FSEdits.logSync() method
level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush()
or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in
the example above

So if a rollEditLog or rollFSIMage is happening at the same time with a write operation it
can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte
(normally the FSEditsLog.OP_INVALID which is the "end-of-file marker") and then remove it
twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from
the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon
cluster restart. 

We got to this point after investigating a corrupted edits file that made HDFS unable to start
with 

{code:title=namenode.log}
java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is 768.

        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
{code}

In the edits file we found the first 2 entries:
{code:title=edits}
FFFFFFEC0900000005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F38333238313438373139303730333137323739000133000D31323631303832363331383335000D313236313038323632383039340008363731303838363400000003F6CBB87EF376E3E600000000040000000000000000039665F9549DE069A5735E00000000040000000000000000039665ADCC71A050B16ABF00000000015A179A0000000000039665066861646F6F700A737570657267726F7570010100000003003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F3833323831343837313930373033313732373900352F68626173652F64656D6F5F5F75736572732F3336343035313634362F746573742F36393137333831323838333034343734333836000D3132363130383236333138363902
...
{code}
This is the completeFile operation that's missing the last byte 

{code:title=completeFile}
FFFFFFEC0900000005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F38333238313438373139303730333137323739000133000D31323631303832363331383335000D313236313038323632383039340008363731303838363400000003F6CBB87EF376E3E600000000040000000000000000039665F9549DE069A5735E00000000040000000000000000039665ADCC71A050B16ABF00000000015A179A0000000000039665066861646F6F700A737570657267726F757001??
{code}
followed by a rename operation

{code:Title=rename}
0100000003003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343035313634362F3833323831343837313930373033313732373900352F68626173652F64656D6F5F5F75736572732F3336343035313634362F746573742F36393137333831323838333034343734333836000D31323631303832363331383639
{code}

The first byte of the rename was instead read as part of the completeFile() operation. This
resulted in reading the next operation as 0x00 (OP_ADD) followed by an int (length) which
would have been 0x0000030 which is 768 that was read and failed in the following code

{code:Title=FSEditLog.java}
 case OP_ADD:
        case OP_CLOSE: {
          // versions > 0 support per file replication
          // get name and replication
          int length = in.readInt();
          if (-7 == logVersion && length != 3||
              -17 < logVersion && logVersion < -7 && length != 4 ||
              logVersion <= -17 && length != 5) {
              throw new IOException("Incorrect data format."  +
                                    " logVersion is " + logVersion +
                                    " but writables.length is " +
                                    length + ". ");
{code}

Filling the missing byte with a value resulted in correct interpretation of the 0x01 (OP_RENAME)
and correct parsing for the rest of the edits file (>1MB)

This theory is also supported by the NameNode log file from the date the corruption took place:

{code:title=namenode.log}
2009-12-17 12:43:51,276 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit
Log from 10.72.17.162
2009-12-17 12:43:51,338 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.165:50010 is added to blk_-480585673051114658_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW],
ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864
2009-12-17 12:43:51,339 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.166:50010 is added to blk_-480585673051114658_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW],
ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864
2009-12-17 12:43:51,342 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock:
/hbase/demo__users/compaction.dir/364051646/8328148719070317279. blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW],
ReplicaUnderConstruction[10.72.17.163:50010|RBW]]}
2009-12-17 12:43:51,352 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.167:50010 is added to blk_-480585673051114658_235109{blockUCState=COMMITTED,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW],
ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864
2009-12-17 12:43:51,833 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.163:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW],
ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498
2009-12-17 12:43:51,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.164:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW],
ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498
2009-12-17 12:43:51,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.166:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW],
ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498





2009-12-17 12:43:51,835 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile:
file /hbase/demo__users/compaction.dir/364051646/8328148719070317279 is closed by DFSClient_-1989779667
2009-12-17 12:43:51,835 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll FSImage
from 10.72.17.162
2009-12-17 12:43:51,870 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop,hadoop
      ip=/10.72.17.166        cmd=rename      src=/hbase/demo__users/compaction.dir/364051646/8328148719070317279
    dst=/hbase/demo__users/364051646/test/6917381288304474386       perm=hadoop:supergroup:rw-r--r--
{code}

The last 3 entries show a completeFile operation followed by a rollFSIMage operation followed
by a rename operation. This is where most probably the race condition took place.
Synchronizing access to EditLogOutputStream could fix the problem, however other race or deadlocks
may still occur. 





{code:title=completeFile}
FFFFFFEC0900000005003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E646972
2F3336343035313634362F38333238313438373139303730333137323739000133000D3132363130383236333
1383335000D313236313038323632383039340008363731303838363400000003F6CBB87EF376E3E60000000
0040000000000000000039665F9549DE069A5735E00000000040000000000000000039665ADCC71A050B16A
BF00000000015A179A0000000000039665066861646F6F700A737570657267726F757001??
{code}
followed by a rename operation

{code:Title=rename}
0100000003003F2F68626173652F64656D6F5F5F75736572732F636F6D70616374696F6E2E6469722F3336343
035313634362F3833323831343837313930373033313732373900352F68626173652F64656D6F5F5F757365727
32F3336343035313634362F746573742F36393137333831323838333034343734333836000D313236313038323
63331383639
{code}

The first byte of the rename was instead read as part of the completeFile() operation. This
resulted in reading the next operation as 0x00 (OP_ADD) followed by an int (length) which
would have been 0x0000030 which is 768 that was read and failed in the following code

{code:Title=FSEditLog.java}
 case OP_ADD:
        case OP_CLOSE: {
          // versions > 0 support per file replication
          // get name and replication
          int length = in.readInt();
          if (-7 == logVersion && length != 3||
              -17 < logVersion && logVersion < -7 && length != 4 ||
              logVersion <= -17 && length != 5) {
              throw new IOException("Incorrect data format."  +
                                    " logVersion is " + logVersion +
                                    " but writables.length is " +
                                    length + ". ");
{code}

Filling the missing byte with a value resulted in correct interpretation of the 0x01 (OP_RENAME)
and correct parsing for the rest of the edits file (>1MB)

This theory is also supported by the NameNode log file from the date the corruption took place:

{code:title=namenode.log}
2009-12-17 12:43:51,276 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit
Log from 10.72.17.162
2009-12-17 12:43:51,338 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.165:50010 is added to blk_-480585673051114658_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW],
ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864
2009-12-17 12:43:51,339 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.166:50010 is added to blk_-480585673051114658_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW],
ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864
2009-12-17 12:43:51,342 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock:
/hbase/demo__users/compaction.dir/364051646/8328148719070317279. blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW],
ReplicaUnderConstruction[10.72.17.163:50010|RBW]]}
2009-12-17 12:43:51,352 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.167:50010 is added to blk_-480585673051114658_235109{blockUCState=COMMITTED,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.165:50010|RBW],
ReplicaUnderConstruction[10.72.17.167:50010|RBW]]} size 67108864
2009-12-17 12:43:51,833 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.163:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW],
ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498
2009-12-17 12:43:51,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.164:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW],
ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498
2009-12-17 12:43:51,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.72.17.166:50010 is added to blk_-5923234476536534337_235109{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.72.17.166:50010|RBW], ReplicaUnderConstruction[10.72.17.164:50010|RBW],
ReplicaUnderConstruction[10.72.17.163:50010|RBW]]} size 22681498





2009-12-17 12:43:51,835 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile:
file /hbase/demo__users/compaction.dir/364051646/8328148719070317279 is closed by DFSClient_-1989779667
2009-12-17 12:43:51,835 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll FSImage
from 10.72.17.162
2009-12-17 12:43:51,870 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop,hadoop
      ip=/10.72.17.166        cmd=rename      src=/hbase/demo__users/compaction.dir/364051646/8328148719070317279
    dst=/hbase/demo__users/364051646/test/6917381288304474386       perm=hadoop:supergroup:rw-r--r--
{code}

The last 3 entries show a completeFile operation followed by a rollFSIMage operation followed
by a rename operation. This is where most probably the race condition took place.
Synchronizing access to EditLogOutputStream could fix the problem, however other race or deadlocks
may still occur. 

> Race condition between rollEditLog or rollFSImage ant FSEditsLog.write operations  corrupts
edits log
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-909
>                 URL: https://issues.apache.org/jira/browse/HDFS-909
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.20.1, 0.20.2, 0.21.0, 0.22.0
>         Environment: CentOS
>            Reporter: Cosmin Lehene
>            Assignee: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.21.0, 0.22.0
>
>         Attachments: hdfs-909-unittest.txt, hdfs-909.txt, hdfs-909.txt
>
>
> closing the edits log file can race with write to edits log file operation resulting
in OP_INVALID end-of-file marker being initially overwritten by the concurrent (in setReadyToFlush)
threads and then removed twice from the buffer, losing a good byte from edits log.
> Example:
> {code}
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream()
-> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.rollEditLog() -> FSEditLog.divertFileStreams() -> FSEditLog.closeStream()
-> EditLogOutputStream.flush() -> EditLogFileOutputStream.flushAndSync()
> OR
> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog()
-> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.setReadyToFlush()

> FSNameSystem.rollFSImage() -> FSIMage.rollFSImage() -> FSEditLog.purgeEditLog()
-> FSEditLog.revertFileStreams() -> FSEditLog.closeStream() ->EditLogOutputStream.flush()
-> EditLogFileOutputStream.flushAndSync()
> VERSUS
> FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.setReadyToFlush()
> FSNameSystem.completeFile -> FSEditLog.logSync() -> EditLogOutputStream.flush()
-> EditLogFileOutputStream.flushAndSync()
> OR 
> Any FSEditLog.write
> {code}
> Access on the edits flush operations is synchronized only in the FSEdits.logSync() method
level. However at a lower level access to EditsLogOutputStream setReadyToFlush(), flush()
or flushAndSync() is NOT synchronized. These can be called from concurrent threads like in
the example above
> So if a rollEditLog or rollFSIMage is happening at the same time with a write operation
it can race for EditLogFileOutputStream.setReadyToFlush that will overwrite the the last byte
(normally the FSEditsLog.OP_INVALID which is the "end-of-file marker") and then remove it
twice (from each thread) in flushAndSync()! Hence there will be a valid byte missing from
the edits log that leads to a SecondaryNameNode silent failure and a full HDFS failure upon
cluster restart. 
> We got to this point after investigating a corrupted edits file that made HDFS unable
to start with 
> {code:title=namenode.log}
> java.io.IOException: Incorrect data format. logVersion is -20 but writables.length is
768. 
>         at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:450
> {code}
> EDIT: moved the logs to a comment to make this readable

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message