hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HDFS-148) timeout when writing dfs file causes infinite loop when closing the file
Date Fri, 21 May 2010 17:35:16 GMT

     [ https://issues.apache.org/jira/browse/HDFS-148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Scott Carey updated HDFS-148:
-----------------------------

    Affects Version/s: 0.20.2
             Priority: Critical  (was: Major)

This is affecting me on CDH2, there are mappers that fail to finish, infinitely spinning until
the very long (in my case) task timeout.

Speculative execution doesn't fix it, because the system thinks the task is 'done' -- but
the reducers are all waiting for the signal to copy the last bit.

This is a rather nasty bug.

The stack trace my infinite loop is on is:

{noformat}
main" prio=10 tid=0x0000000041778000 nid=0x9a5 waiting on condition [0x0000000041078000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3276)
	- locked <0x00002aaac3c41498> (a org.apache.hadoop.hdfs.DFSClient$DFSOutputStream)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3197)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
{noformat}

The repeating log lines are:
2010-05-21 09:42:57,405 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /tmp/temp-716117623/tmp-542952532/_temporary/_attempt_201005191706_0492_m_000072_0/part-m-00072
retrying...
2010-05-21 09:42:57,807 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /tmp/temp-716117623/tmp-542952532/_temporary/_attempt_201005191706_0492_m_000072_0/part-m-00072
retrying...
2010-05-21 09:42:58,209 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /tmp/temp-716117623/tmp-542952532/_temporary/_attempt_201005191706_0492_m_000072_0/part-m-00072
retrying...
2010-05-21 09:42:58,611 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /tmp/temp-716117623/tmp-542952532/_temporary/_attempt_201005191706_0492_m_000072_0/part-m-00072
retrying...


The task attempt log trace that happened before the infinite loop was:

{quote}
2010-05-21 09:33:15,145 INFO org.apache.hadoop.mapred.MapTask: kvstart = 335542; kvend = 167769;
length = 838860
2010-05-21 09:34:29,210 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block blk_-2993127957611759033_4581816java.net.SocketTimeoutException: 69000
millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/10.102.0.211:37961 remote=/10.102.0.211:50010]
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
	at java.io.DataInputStream.readFully(DataInputStream.java:178)
	at java.io.DataInputStream.readLong(DataInputStream.java:399)
	at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:119)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2410)

2010-05-21 09:34:57,055 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-2993127957611759033_4581816
bad datanode[0] 10.102.0.211:50010
2010-05-21 09:34:57,055 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-2993127957611759033_4581816
in pipeline 10.102.0.211:50010, 10.102.0.214:50010, 10.102.0.207:50010: bad datanode 10.102.0.211:50010
2010-05-21 09:34:58,406 INFO org.apache.hadoop.mapred.MapTask: Finished spill 3
2010-05-21 09:35:18,471 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record
full = true
2010-05-21 09:35:18,472 INFO org.apache.hadoop.mapred.MapTask: bufstart = 176220966; bufend
= 220167829; bufvoid = 255013696
2010-05-21 09:35:18,472 INFO org.apache.hadoop.mapred.MapTask: kvstart = 167769; kvend = 838857;
length = 838860
2010-05-21 09:35:25,542 INFO org.apache.hadoop.mapred.MapTask: Finished spill 4
2010-05-21 09:35:40,440 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record
full = true
2010-05-21 09:35:40,440 INFO org.apache.hadoop.mapred.MapTask: bufstart = 220167829; bufend
= 9073254; bufvoid = 255013644
2010-05-21 09:35:40,440 INFO org.apache.hadoop.mapred.MapTask: kvstart = 838857; kvend = 671084;
length = 838860
2010-05-21 09:35:48,957 INFO org.apache.hadoop.mapred.MapTask: Finished spill 5
2010-05-21 09:35:56,910 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /tmp/temp-716117623/tmp-542952532/_temporary/_attempt_201005191706_0492_m_000072_0/part-m-00072
retrying...
{quote}


> timeout when writing dfs file causes infinite loop when closing the file
> ------------------------------------------------------------------------
>
>                 Key: HDFS-148
>                 URL: https://issues.apache.org/jira/browse/HDFS-148
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: Nigel Daley
>            Assignee: Sameer Paranjpye
>            Priority: Critical
>
> If, when writing to a dfs file, I get a timeout exception:
> 06/11/29 11:16:05 WARN fs.DFSClient: Error while writing.
> java.net.SocketTimeoutException: timed out waiting for rpc response
>        at org.apache.hadoop.ipc.Client.call(Client.java:469)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:164)
>        at org.apache.hadoop.dfs.$Proxy0.reportWrittenBlock(Unknown Source)
>        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.internalClose(DFSClient.java:1220)
>        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:1175)
>        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.flush(DFSClient.java:1121)
>        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.write(DFSClient.java:1103)
>        at org.apache.hadoop.examples.NNBench2.createWrite(NNBench2.java:107)
>        at org.apache.hadoop.examples.NNBench2.main(NNBench2.java:247)
> then the close() operation on the file appears to go into an infinite loop of retrying:
> 06/11/29 13:11:19 INFO fs.DFSClient: Could not complete file, retrying...
> 06/11/29 13:11:20 INFO fs.DFSClient: Could not complete file, retrying...
> 06/11/29 13:11:21 INFO fs.DFSClient: Could not complete file, retrying...
> 06/11/29 13:11:23 INFO fs.DFSClient: Could not complete file, retrying...
> 06/11/29 13:11:24 INFO fs.DFSClient: Could not complete file, retrying...
> ...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message