hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Stack <st...@duboce.net>
Subject IOE job exceptions and a 'hang' on invertlinks.
Date Wed, 08 Mar 2006 17:55:06 GMT
Below are some exceptions gleaned from tasktracker logs.  I was 
wondering if others are seeing similar and if suggestions on how to 
avoid/fix them.

First, the general setting is a rack of 35 reasonably fast machines 
lightly loaded running one child per slave with the head node running 
jobtracker and namenode only.  Am running hadoop revision 383722 -- 
Yesterday or perhaps day before -- and I'm running nutch as my job.

Most settings are default (including ipc timeout).

1. I saw a good few instances of the below during reduce phase for a job 
that runs big files of web pages through nutch parsers (from a 
tasktracker log).  Reading code, it looks like NameNode failed to 
allocate new blocks after trying the maximum of 5 times w/ a sleep of 
100ms in between each attempt.  I'll try and dig deeper to try and 
elicit why namenode is failing but perhaps others on the list can 
comment why?

060308 053947 task_r_b2avsx java.io.IOException: Cannot obtain 
additional block for file 
/user/stack/nara/outputs/segments/20060307210958/crawl_fetch/part-00028/.data.crc
060308 053947 task_r_b2avsx     at 
org.apache.hadoop.ipc.Client.call(Client.java:301)
060308 053947 task_r_b2avsx     at 
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141)
060308 053947 task_r_b2avsx     at 
org.apache.hadoop.dfs.$Proxy1.addBlock(Unknown Source)
060308 053947 task_r_b2avsx     at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:579)
060308 053947 task_r_b2avsx     at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.endBlock(DFSClient.java:788)
060308 053947 task_r_b2avsx     at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:856)
060308 053947 task_r_b2avsx     at 
java.io.FilterOutputStream.close(FilterOutputStream.java:143)
060308 053947 task_r_b2avsx     at 
java.io.FilterOutputStream.close(FilterOutputStream.java:143)
060308 053947 task_r_b2avsx     at 
java.io.FilterOutputStream.close(FilterOutputStream.java:143)
060308 053947 task_r_b2avsx     at 
org.apache.hadoop.fs.FSDataOutputStream$Summer.close(FSDataOutputStream.java:78)
060308 053947 task_r_b2avsx     at 
java.io.FilterOutputStream.close(FilterOutputStream.java:143)
060308 053947 task_r_b2avsx     at 
java.io.FilterOutputStream.close(FilterOutputStream.java:143)
060308 053947 task_r_b2avsx     at 
java.io.FilterOutputStream.close(FilterOutputStream.java:143)
060308 053947 task_r_b2avsx     at 
org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:137)
060308 053947 task_r_b2avsx     at 
org.apache.hadoop.io.MapFile$Writer.close(MapFile.java:119)
060308 053947 task_r_b2avsx     at 
org.apache.nutch.fetcher.FetcherOutputFormat$1.close(FetcherOutputFormat.java:86)
060308 053947 task_r_b2avsx     at 
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:291)
060308 053947 task_r_b2avsx     at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:666)


2. I saw just a few instances of the below during the crawldb update but 
on a slower rack, it happens often.  On the slower rack I tried updating 
the ipc timeout but it didn't seem to make a difference.  Would be 
interested in the thoughts of others on how to avoid below.  On slower 
rack, it actually stops jobs completing (Why would tasktracker not 
respond to child?  Tasktrackers aren't doing much, or are they?).

060308 114943 task_r_1qxohh 0.9803678% reduce > reduce
060308 115043 task_r_1qxohh  Parent died.  Exiting task_r_1qxohh
060308 115043 task_r_1qxohh java.io.IOException: timed out waiting for 
response
060308 115043 task_r_1qxohh     at 
org.apache.hadoop.ipc.Client.call(Client.java:303)  
060308 115043 task_r_1qxohh     at 
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141)
060308 115043 task_r_1qxohh     at 
org.apache.hadoop.mapred.$Proxy0.ping(Unknown Source)
060308 115043 task_r_1qxohh     at 
org.apache.hadoop.mapred.TaskTracker$Child$1.run(TaskTracker.java:686)
060308 115043 task_r_1qxohh     at java.lang.Thread.run(Thread.java:595)
060308 115043 task_r_1qxohh 0.9823446% reduce > reduce
060308 115043 Server connection on port 50050 from 207.241.227.179: exiting
060308 115043 Server connection on port 50050 from 207.241.227.179: exiting
060308 115043 KILLING CHILD PROCESS task_r_1qxohh
060308 115043 task_r_1qxohh Child Error
java.io.IOException: Task process exit with nonzero status.
    at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:271)   
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:145)

(The 'KILLING...' is my addition).

3. Rack looks to be currently 'hung' on invertlinks.  All reduce tasks 
show the exact same 0.25 or so complete.  No emissions out of jobtracker 
in last 4 hours.  Namenode log has block reports.  All CPUs are 
quiescent -- even jobtracker.  5 reduce tasks had the below exception:

060308 132336 task_r_9swl2k  Client connection to 207.241.228.28:8009: 
starting
060308 132336 task_r_9swl2k  parsing 
file:/0/hadoop/nara/app/runtime-conf/hadoop-default.xml
060308 132336 task_r_9swl2k  parsing 
file:/0/hadoop/nara/app/runtime-conf/hadoop-site.xml
060308 132336 Server connection on port 50050 from 207.241.227.176: starting
060308 132336 task_r_9swl2k 0.75% reduce > reduce060308 132336 
task_r_9swl2k  Client connection to 0.0.0.0:50050: starting
060308 132339 task_r_9swl2k  Error running child060308 132339 
task_r_9swl2k java.lang.RuntimeException: java.io.EOFException
060308 132339 task_r_9swl2k     at 
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:132)
060308 132339 task_r_9swl2k     at 
org.apache.nutch.crawl.LinkDb.reduce(LinkDb.java:108)
060308 132339 task_r_9swl2k     at 
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:283)
060308 132339 task_r_9swl2k     at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:666)
060308 132339 task_r_9swl2k Caused by: java.io.EOFException
060308 132339 task_r_9swl2k     at 
java.io.DataInputStream.readFully(DataInputStream.java:178)
060308 132339 task_r_9swl2k     at 
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
060308 132339 task_r_9swl2k     at 
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
060308 132339 task_r_9swl2k     at 
org.apache.hadoop.io.UTF8.readChars(UTF8.java:212)
060308 132339 task_r_9swl2k     at 
org.apache.hadoop.io.UTF8.readString(UTF8.java:204)
060308 132339 task_r_9swl2k     at 
org.apache.nutch.crawl.Inlink.readFields(Inlink.java:36)
060308 132339 task_r_9swl2k     at 
org.apache.nutch.crawl.Inlink.read(Inlink.java:53)  
060308 132339 task_r_9swl2k     at 
org.apache.nutch.crawl.Inlinks.readFields(Inlinks.java:44)
060308 132339 task_r_9swl2k     at 
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:347)
060308 132339 task_r_9swl2k     at 
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:163)
060308 132339 task_r_9swl2k     at 
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:129)
060308 132339 task_r_9swl2k     ... 3 more
060308 132340 Server connection on port 50050 from 207.241.227.176: 
exiting060308 132340 KILLING CHILD PROCESS task_r_9swl2k

A thread dump from job tracker shows a bunch of threads in this state:

Full thread dump Java HotSpot(TM) Server VM (1.5.0_06-b05 mixed mode):
   
"Server connection on port 8010 from xxx.xxx.xxx.xxx" daemon prio=1 
tid=0xad324720 nid=0x2074 runnable [0xac07e000..0xac07ee40]    at 
java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
    - locked <0xb4570d08> (a java.io.BufferedInputStream)
    at java.io.DataInputStream.readInt(DataInputStream.java:353)
    at org.apache.hadoop.ipc.Server$Connection.run(Server.java:129)

If I connect to the mentioned tasktrackers, the tasktrackers of IP 
xxx.xxx.xxx.xxx from the jobtracker thread dump, no children are running....

Any pointers appreciated.  Meantime will keep digging.

St.Ack

Mime
View raw message