hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Binglin Chang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-4272) Problem in DFSInputStream read retry logic may cause early failure
Date Wed, 05 Dec 2012 15:02:59 GMT
Binglin Chang created HDFS-4272:
-----------------------------------

             Summary: Problem in DFSInputStream read retry logic may cause early failure
                 Key: HDFS-4272
                 URL: https://issues.apache.org/jira/browse/HDFS-4272
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Binglin Chang
            Assignee: Binglin Chang
            Priority: Minor


Assume the following call logic
{noformat} 
readWithStrategy()
  -> blockSeekTo()
  -> readBuffer()
     -> reader.doRead()
     -> seekToNewSource() add currentNode to deadnode, wish to get a different datanode
        -> blockSeekTo()
           -> chooseDataNode()
              -> block missing, clear deadNodes and pick the currentNode again
        seekToNewSource() return false
     readBuffer() re-throw the exception quit loop
readWithStrategy() got the exception,  and may fail the read call before tried MaxBlockAcquireFailures.
{noformat} 
some issues of the logic:
1. seekToNewSource() logic is broken because it may clear deadNodes in the middle.
2. the variable "int retries=2" in readWithStrategy seems have conflict with MaxBlockAcquireFailures,
should it be removed?

I write a test to produce the scenario, and here is part of the log:

{noformat} 
2012-12-05 22:55:15,135 WARN  hdfs.DFSClient (DFSInputStream.java:readBuffer(596)) - Found
Checksum error for BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002 from
127.0.0.1:50099 at 0
2012-12-05 22:55:15,136 INFO  DataNode.clienttrace (BlockSender.java:sendBlock(672)) - src:
/127.0.0.1:50099, dest: /127.0.0.1:50105, bytes: 4128, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-1488457569_1,
offset: 0, srvID: DS-91625336-192.168.0.101-50099-1354719314603, blockid: BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002,
duration: 2925000
2012-12-05 22:55:15,136 INFO  hdfs.DFSClient (DFSInputStream.java:chooseDataNode(741)) - Could
not obtain BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002 from any node:
java.io.IOException: No live nodes contain current block. Will get new block locations from
namenode and retry...
2012-12-05 22:55:15,136 WARN  hdfs.DFSClient (DFSInputStream.java:chooseDataNode(756)) - DFS
chooseDataNode: got # 1 IOException, will wait for 274.34891931868265 msec.
2012-12-05 22:55:15,413 INFO  DataNode.clienttrace (BlockSender.java:sendBlock(672)) - src:
/127.0.0.1:50099, dest: /127.0.0.1:50106, bytes: 4128, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-1488457569_1,
offset: 0, srvID: DS-91625336-192.168.0.101-50099-1354719314603, blockid: BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002,
duration: 283000
2012-12-05 22:55:15,414 INFO  hdfs.StateChange (FSNamesystem.java:reportBadBlocks(4761)) -
*DIR* reportBadBlocks
2012-12-05 22:55:15,415 INFO  BlockStateChange (CorruptReplicasMap.java:addToCorruptReplicasMap(66))
- BLOCK NameSystem.addToCorruptReplicasMap: blk_-705068286766485620 added as corrupt on 127.0.0.1:50099
by null because client machine reported it
2012-12-05 22:55:15,416 INFO  hdfs.TestClientReportBadBlock (TestDFSInputStream.java:testDFSInputStreamReadRetryTime(94))
- catch IOExceptionorg.apache.hadoop.fs.ChecksumException: Checksum error: /testFile at 0
exp: 809972010 got: -1374622118
2012-12-05 22:55:15,431 INFO  hdfs.MiniDFSCluster (MiniDFSCluster.java:shutdown(1411)) - Shutting
down the Mini HDFS Cluster
{noformat} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message