hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiao Chen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6532) Intermittent test failure org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
Date Tue, 04 Oct 2016 22:49:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546919#comment-15546919
] 

Xiao Chen commented on HDFS-6532:
---------------------------------

Looked more into this. For failed cases, we see (copied from the 'select-timeout' attachment):
{noformat}
2016-10-04 22:10:24,365 INFO  hdfs.DFSOutputStream (DFSOutputStream.java:run(1114)) - ======
 
java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[closed].
28459 millis timeout left.
        at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:352)
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
        at java.io.FilterInputStream.read(FilterInputStream.java:83)
        at java.io.FilterInputStream.read(FilterInputStream.java:83)
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2247)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:235)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1015)
{noformat}

And for the success cases, we see:
{noformat}
2016-10-04 15:13:15,271 INFO  hdfs.DFSOutputStream (DFSOutputStream.java:run(1116)) - ======
 
java.io.IOException: Bad response ERROR for block BP-1283991366-172.16.3.181-1475619192335:blk_1073741825_1001
from datanode DatanodeInfoWithStorage[127.0.0.1:61321,DS-720243dd-55b6-49ef-ae55-4462e20260d5,DISK]
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1053)
{noformat}
and 
{noformat}
2016-10-04 15:13:16,084 INFO  hdfs.DFSOutputStream (DFSOutputStream.java:run(1116)) - ======
 
java.io.EOFException: Premature EOF: no length prefix available
	at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2249)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:235)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1017)
{noformat}
I printed the exception from [this line|https://github.com/apache/hadoop/blob/44f48ee96ee6b2a3909911c37bfddb0c963d5ffc/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1149].

So in the failed cases, the responder is running in [this loop|https://github.com/apache/hadoop/blob/44f48ee96ee6b2a3909911c37bfddb0c963d5ffc/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L708],
until the following exception is thrown
{noformat}2016-10-04 22:36:40,403 INFO  datanode.DataNode (BlockReceiver.java:receiveBlock(941))
- Exception for BP-2046749708-172.17.0.1-1475620536833:blk_1073741826_1005
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready
for read. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:42956 remote=/127.0.0.1:56324]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:502)
        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:900)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:802)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246)
        at java.lang.Thread.run(Thread.java:745)
2016-10-04 22:36:40,469 INFO  hdfs.DFSOutputStream (DFSOutputStream.java:run(1116)) - ======
 
java.io.EOFException: Premature EOF: no length prefix available
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2249)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:235)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:1017)
{noformat}
Afterwards the {{close}} call correctly returns and the test passes. Not sure how we can interrupt
early in this case. Since there's no impact on correctness, maybe we should just add the test
timeout. [~kihwal], could you share your thoughts on this? Thanks a lot.

> Intermittent test failure org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
> ------------------------------------------------------------------------------------------
>
>                 Key: HDFS-6532
>                 URL: https://issues.apache.org/jira/browse/HDFS-6532
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs-client
>    Affects Versions: 2.4.0
>            Reporter: Yongjun Zhang
>            Assignee: Yiqun Lin
>         Attachments: HDFS-6532.001.patch, HDFS-6532.002.patch, PreCommit-HDFS-Build #16770
test - testCorruptionDuringWrt [Jenkins].pdf, TEST-org.apache.hadoop.hdfs.TestCrcCorruption-select_timeout.xml,
TEST-org.apache.hadoop.hdfs.TestCrcCorruption.xml
>
>
> Per https://builds.apache.org/job/Hadoop-Hdfs-trunk/1774/testReport, we had the following
failure. Local rerun is successful
> {code}
> Regression
> org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt
> Failing for the past 1 build (Since Failed#1774 )
> Took 50 sec.
> Error Message
> test timed out after 50000 milliseconds
> Stacktrace
> java.lang.Exception: test timed out after 50000 milliseconds
> 	at java.lang.Object.wait(Native Method)
> 	at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2024)
> 	at org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2008)
> 	at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2107)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:98)
> 	at org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:133)
> {code}
> See relevant exceptions in log
> {code}
> 2014-06-14 11:56:15,283 WARN  datanode.DataNode (BlockReceiver.java:verifyChunks(404))
- Checksum error in block BP-1675558312-67.195.138.30-1402746971712:blk_1073741825_1001 from
/127.0.0.1:41708
> org.apache.hadoop.fs.ChecksumException: Checksum error: DFSClient_NONMAPREDUCE_-1139495951_8
at 64512 exp: 1379611785 got: -12163112
> 	at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:353)
> 	at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:284)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:402)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:537)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
> 	at java.lang.Thread.run(Thread.java:662)
> 2014-06-14 11:56:15,285 WARN  datanode.DataNode (BlockReceiver.java:run(1207)) - IOException
in BlockReceiver.run(): 
> java.io.IOException: Shutting down writer and responder due to a checksum error in received
data. The error response has been sent upstream.
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1352)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1278)
> 	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1199)
> 	at java.lang.Thread.run(Thread.java:662)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message