hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From monu.o...@richmondinformatics.com
Subject Hadoop errors/delays
Date Sun, 26 Mar 2006 13:02:50 GMT
Hello Team,

I am running nutch with mapreduce/DFS using:

- nutch-0.8-dev (rev="387659" of 2006-03-23) with

- hadoop-0.1-dev (rev="387664") built and merged into <nutchdir>/lib/hadoop

- A cluster of 6 Xeon 3Ghz servers with 2Gig RAM each is running Centos 4.2,
running as 1 master and 5 slaves; and

- Sun's jdk1.5.0_06

My initial goal is to create and populate segments of 1 million pages each.
(although the eventual requirement will be for 10 x 10 million-page segments
handling 100 million pages)

I'm encountering a number of long delays apparently resulting from the errors
reproduced below.  Happily, while processes seemed to hang in previous
versions, this recent build of nutch/hadoop appears to recover from the errors.
 Unfortunately, though, each such "recovery" can take several hours.

I have been following the discussions re: HADOOP 83,86 and 98, but don't have
the experience to work out if what I'm experiencing below is relevant.

Currently, the webdb contains:

060326 065335 Statistics for CrawlDb: crawlA/db
060326 065335 TOTAL urls:       31511017
060326 065335 avg score:        1.13
060326 065335 max score:        12844.727
060326 065335 min score:        1.0
060326 065335 retry 0:  31360006
060326 065335 retry 1:  139249
060326 065335 retry 2:  7669
060326 065335 retry 3:  4093
060326 065335 status 1 (DB_unfetched):  27729501
060326 065335 status 2 (DB_fetched):    3715424
060326 065335 status 3 (DB_gone):       66092
060326 065335 CrawlDb statistics: done


** Rough benchmarks:

Fetching 1 million - 9hrs
	Physical crawling - 5hours
	Reducing first 90% - 1hr
	Reducing last 10% and recovering from errors - up to 3hrs.

Generating 1 million segment - 3.5hrs
	Reducing first 90% - 1hr
	Reducing last 10% and recovering from errors - up to 1.5hrs.

updatedb - 45mins

invertlinks - 32mins

readdb <db> -stats - 1hr 10mins

** The following are my investigations, which will, hopefully, be more
meaningful to the team than they are to me.


First search this log for "Exception"


Some example errors from :

# grep Exception hadoop-root-jobtracker-nutch0.my.domain.log

060324 211051 Error from task_r_761v2w: Timed out.java.io.IOException: Task
process exit with nonzero status.
060324 214242 Error from task_r_9yacml: Timed out.java.io.IOException: Task
process exit with nonzero status.
060324 221452 Error from task_r_eyxuak: Timed out.java.io.IOException: Task
process exit with nonzero status.
java.io.IOException: Not a file:
/user/root/crawlA/segments/20060324135138/crawl_fetch/part-00002/data
java.io.IOException: Not a file:
/user/root/crawlA/segments/20060324135138/parse_data/part-00002/data
060325 015217 Error from task_m_8ftcxp: Timed out.java.io.IOException: Task
process exit with nonzero status.
060325 021657 Error from task_m_e0pg3g: Timed out.java.io.IOException: Task
process exit with nonzero status.

Investigating the last case, for example:

# grep task_m_e0pg3g *

hadoop-root-jobtracker-nutch0.my.domain.log:060325 015227 Adding task
'task_m_e0pg3g' to tip tip_nkkxau, for tracker 'tracker_87920'
hadoop-root-jobtracker-nutch0.my.domain.log:060325 021657 Error from
task_m_e0pg3g: Timed out.java.io.IOException: Task process exit with nonzero
status.
hadoop-root-jobtracker-nutch0.my.domain.log:060325 021657 Task 'task_m_e0pg3g'
has been lost.

Hunt down the slave(s?) on which task_m_e0pg3g was running:

# ssh nutch1 - 5 ; cd ~nutch/nutch-2006-03-23/logs ; grep task_m_e0pg3g *

hadoop-root-tasktracker-nutch3.my.domain.log:060325 021425 Task task_m_e0pg3g
timed out.  Killing.
hadoop-root-tasktracker-nutch3.my.domain.log:060325 021425 task_m_e0pg3g Child
Error
hadoop-root-tasktracker-nutch3.my.domain.log:060325 021428 task_m_e0pg3g done;
removing files.

On nutch3

# less ~nutch/nutch-2006-03-23/logs/hadoop-root-tasktracker-nutch3.my.domain.log

..... snipped for brevity ......

060325 015417 task_m_e0pg3g 0.6349619%
/user/root/crawlA/db/current/part-00001/data:64000000+32000000

..... snipped for brevity ......

060325 021425 Task task_m_e0pg3g timed out.  Killing.
060325 021425 Server connection on port 50050 from 193.203.240.120: exiting
060325 021425 Server connection on port 50050 from 193.203.240.120: exiting
060325 021425 task_m_e0pg3g Child Error
java.io.IOException: Task process exit with nonzero status.
        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:273)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:145)
060325 021428 task_m_e0pg3g done; removing files.

So the task timed out after 20 minutes.

** I have tried increasing the mapred.task.timeout in the hadoop-site.xml to an
hour (360000 milliseconds)?

<property>
  <name>mapred.task.timeout</name>
  <value>3600000</value>
</property>

And, have also tried incrementing ipc.client.timeout

<!-- ipc properties -->

<property>
  <name>ipc.client.timeout</name>
  <value>180000</value>
</property>

But these don't seem to make much difference.

060325 100915 task_r_3kjrdd copy failed: task_m_azvb0x from
nutch3.my.domain/193.203.240.120:50040
java.io.IOException: timed out waiting for response
        at org.apache.hadoop.ipc.Client.call(Client.java:305)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141)
        at org.apache.hadoop.mapred.$Proxy2.getFile(Unknown Source)
        at
org.apache.hadoop.mapred.ReduceTaskRunner.prepare(ReduceTaskRunner.java:106)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:66)
060325 100915 task_r_3kjrdd 0.2% reduce > copy >
task_m_azvb0x@nutch3.my.domain:50040
060325 100915 task_r_3kjrdd Got 1 map output locations.

** Finally, could it be that my server configurations are "wrong" or less than
optimal?

ulimit is "unlimited"

The size of the heap is the default

Don't think IPV6 is "enabled"

/etc/ssh/sshd_config contains "AcceptEnv HADOOP_CONF_DIR"

conf/hadoop-env.sh - all untouched.

**

I'd be very grateful for any clues as to what I might be doing wrong.

Many thanks,

Monu Ogbe


Mime
View raw message