hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Copy errors - unable to complete jobs.
Date Fri, 31 Mar 2006 23:16:32 GMT
Hi,

I'm getting frequent errors running Nutch fetcher, but it seems to me 
the problem lies in Hadoop. As a result I'm unable to complete any fetching.

The environment seems healthy, it consists of 6 tasktracker/datanode 
machines and one jobtracker/nameserver.

The map phase works flawlessly. However, reduce tasks report two kinds 
of errors (and in the end the whole job fails):

* from time to time a reduce task would say "No valid local directories 
in property: mapred.local.dir" - but there is such a property defined in 
hadoop-site.xml, and it points to a valid location. These errors seem 
innocuous, and reducing continues anyway.

* 060331 055932 task_r_8kge0k copy failed: task_m_67k3s0 from 
nutch5.xxx.xxx/nnn.nnn.nnn.nnn:50040
java.io.IOException: timed out waiting for response
        at org.apache.hadoop.ipc.Client.call(Client.java:305)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141)
        at org.apache.hadoop.mapred.$Proxy2.getFile(Unknown Source)
        at 
org.apache.hadoop.mapred.ReduceTaskRunner.prepare(ReduceTaskRunner.java:106)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:66)
060331 055932 task_r_8kge0k 0.07692308% reduce > copy > 
task_m_67k3s0@nutch5.xxx.xxx:50040
060331 055932 task_r_8kge0k Got 1 map output locations.

This error in the end is fatal. In the logs I can see that copying is 
attempted several times, but finally it gives up. The same problem 
occurs frequently for all reduce tasks, but most of the time it finally 
succeeds. If I could venture a wild guess I would say that some of the 
map output is removed prematurely; or perhaps the source node is so busy 
it cannot bother to response in a timely fashion?

What is especially irritating with these errors is that they happen at 
the very end of a long running job, where 99.5% of all work is done 
(including fetching millions pages from the 'net) - and all this work is 
irretrievably lost.

I dearly wish for an option in Hadoop that would allow me to keep at 
least the map results, and restart the job so that it performs just the 
reduce step. This way I wouldn't have to refetch a million pages... 
(which also speaks for processing smaller segments than larger).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Mime
View raw message