hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Re: DFSClient: Could not complete file
Date Tue, 29 Mar 2011 23:53:49 GMT
Hi Chris,

One thing we've found helping in ext3 is examining your I/O scheduler.  Make sure it's set
to "deadline", not "CFQ".  This will help prevent nodes from being overloaded; when "du -sk"
is performed and the node is already overloaded, things quickly roll downhill.

Brian

On Mar 29, 2011, at 11:44 AM, Chris Curtin wrote:

> We are narrowing this down. The last few times it hung we found a 'du -sk'
> process for each our HDFS disks as the top users of CPU. They are also
> taking a really long time.
> 
> Searching around I find one example of someone reporting a similar issue
> with du -sk, but they tied it to XFS. We are using Ext3.
> 
> Anyone have any other ideas since it appears to be related to the 'du' not
> coming back? Note that running the command directly finishes in a few
> seconds.
> 
> Thanks,
> 
> Chris
> 
> On Wed, Mar 16, 2011 at 9:41 AM, Chris Curtin <curtin.chris@gmail.com>wrote:
> 
>> Caught something today I missed before:
>> 
>> 11/03/16 09:32:49 INFO hdfs.DFSClient: Exception in createBlockOutputStream
>> java.io.IOException: Bad connect ack with firstBadLink 10.120.41.105:50010
>> 11/03/16 09:32:49 INFO hdfs.DFSClient: Abandoning block
>> blk_-517003810449127046_10039793
>> 11/03/16 09:32:49 INFO hdfs.DFSClient: Waiting to find target node:
>> 10.120.41.103:50010
>> 11/03/16 09:34:04 INFO hdfs.DFSClient: Exception in createBlockOutputStream
>> java.net.SocketTimeoutException: 69000 millis timeout while waiting for
>> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
>> local=/10.120.41.85:34323 remote=/10.120.41.105:50010]
>> 11/03/16 09:34:04 INFO hdfs.DFSClient: Abandoning block
>> blk_2153189599588075377_10039793
>> 11/03/16 09:34:04 INFO hdfs.DFSClient: Waiting to find target node:
>> 10.120.41.105:50010
>> 11/03/16 09:34:55 INFO hdfs.DFSClient: Could not complete file
>> /tmp/hadoop/mapred/system/job_201103160851_0014/job.jar retrying...
>> 
>> 
>> 
>> On Wed, Mar 16, 2011 at 9:00 AM, Chris Curtin <curtin.chris@gmail.com>wrote:
>> 
>>> Thanks. Spent a lot of time looking at logs and nothing on the reducers
>>> until they start complaining about 'could not complete'.
>>> 
>>> Found this in the jobtracker log file:
>>> 
>>> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient:
>>> DFSOutputStream ResponseProcessor exception  for block
>>> blk_3829493505250917008_9959810java.io.IOException: Bad response 1 for block
>>> blk_3829493505250917008_9959810 from datanode 10.120.41.103:50010
>>>        at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2454)
>>> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>> Recovery for block blk_3829493505250917008_9959810 bad datanode[2]
>>> 10.120.41.103:50010
>>> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>> Recovery for block blk_3829493505250917008_9959810 in pipeline
>>> 10.120.41.105:50010, 10.120.41.102:50010, 10.120.41.103:50010: bad
>>> datanode 10.120.41.103:50010
>>> 2011-03-16 02:38:53,133 INFO org.apache.hadoop.hdfs.DFSClient: Could not
>>> complete file
>>> /var/hadoop/tmp/2_20110316_pmta_pipe_2_20_50351_2503122/_logs/history/hadnn01.atlis1_1299879680612_job_201103111641_0312_deliv_2_20110316_pmta_pipe*2_20110316_%5B%281%2F3%29+...QUEUED_T
>>> retrying...
>>> 
>>> Looking at the logs from the various times this happens, the 'from
>>> datanode' in the first message is any of the data nodes (roughly equal in #
>>> of times it fails), so I don't think it is one specific node having
>>> problems.
>>> Any other ideas?
>>> 
>>> Thanks,
>>> 
>>> Chris
>>>  On Sun, Mar 13, 2011 at 3:45 AM, icebergs <hkmstu@gmail.com> wrote:
>>> 
>>>> You should check the bad reducers' logs carefully.There may be more
>>>> information about it.
>>>> 
>>>> 2011/3/10 Chris Curtin <curtin.chris@gmail.com>
>>>> 
>>>>> Hi,
>>>>> 
>>>>> The last couple of days we have been seeing 10's of thousands of these
>>>>> errors in the logs:
>>>>> 
>>>>> INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file
>>>>> 
>>>>> 
>>>> /offline/working/3/aat/_temporary/_attempt_201103100812_0024_r_000003_0/4129371_172307245/part-00003
>>>>> retrying...
>>>>> When this is going on the reducer in question is always the last
>>>> reducer in
>>>>> a job.
>>>>> 
>>>>> Sometimes the reducer recovers. Sometimes hadoop kills that reducer,
>>>> runs
>>>>> another and it succeeds. Sometimes hadoop kills the reducer and the new
>>>> one
>>>>> also fails, so it gets killed and the cluster goes into a loop of
>>>>> kill/launch/kill.
>>>>> 
>>>>> At first we thought it was related to the size of the data being
>>>> evaluated
>>>>> (4+GB), but we've seen it several times today on < 100 MB
>>>>> 
>>>>> Searching here or online doesn't show a lot about what this error means
>>>> and
>>>>> how to fix it.
>>>>> 
>>>>> We are running 0.20.2, r911707
>>>>> 
>>>>> Any suggestions?
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Chris
>>>>> 
>>>> 
>>> 
>>> 
>> 


Mime
View raw message