Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 93804 invoked from network); 8 Mar 2006 20:22:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 8 Mar 2006 20:22:48 -0000 Received: (qmail 30053 invoked by uid 500); 8 Mar 2006 20:22:44 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 30037 invoked by uid 500); 8 Mar 2006 20:22:43 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 30028 invoked by uid 99); 8 Mar 2006 20:22:43 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar 2006 12:22:43 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [63.203.238.114] (HELO dns.duboce.net) (63.203.238.114) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar 2006 12:22:42 -0800 Received: from [192.168.1.105] ([192.168.1.105]) by dns-eth1.duboce.net (8.10.2/8.10.2) with ESMTP id k28JCoZ13842; Wed, 8 Mar 2006 11:12:51 -0800 Message-ID: <440F3CF4.2080904@archive.org> Date: Wed, 08 Mar 2006 12:22:12 -0800 From: Michael Stack User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.0.1) Gecko/20060127 SeaMonkey/1.0 MIME-Version: 1.0 To: hadoop-user@lucene.apache.org Subject: More about 'hang' on invertlinks. References: <440F1A7A.6000103@duboce.net> In-Reply-To: <440F1A7A.6000103@duboce.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I've more information on below hang: Looking in list of reduce tasks in jobdetails all seem to be hung waiting on one host. Here is the status message that all of my 31 reduce tasks exhibit: reduce > copy > task_m_5g00f1@ia109309.archive.org:50040 Looking over on the 309 machine, I see that as far as its tasktracker is concerned, the task_m_5g00f1 completed fine: 060308 112952 parsing file:/0/hadoop/nara/app/runtime-conf/hadoop-site.xml 060308 112952 task_m_5g00f1 1.0% /user/stack/nara/outputs/segments/2006030721095 20:224000000+32000000 060308 112953 Task task_m_5g00f1 is done. ... ...and then it goes to do cleanup: 060308 122603 task_m_5g00f1 done; removing files. 5 seconds later I start seeing the likes of the below in the log and they are unending: 060308 122626 Server handler 0 on 50040 caught: java.io.FileNotFoundException: / 00f1/part-19.out java.io.FileNotFoundException: /0/hadoop/tmp/task_m_5g00f1/part-19.out at org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:113) at org.apache.hadoop.fs.FSDataInputStream$Checker.(FSDataInputStream.j at org.apache.hadoop.fs.FSDataInputStream.(FSDataInputStream.java:228) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:154) at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:106) at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117) at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:215) Looks like ipc Server handlers stuck trying to read parts since removed. Things look a little confused. Will keep digging... but any suggestions as to what might be going on or things to try appreciated. St.Ack Michael Stack wrote: > ... > > 3. Rack looks to be currently 'hung' on invertlinks. All reduce tasks > show the exact same 0.25 or so complete. No emissions out of > jobtracker in last 4 hours. Namenode log has block reports. All CPUs > are quiescent -- even jobtracker. 5 reduce tasks had the below > exception: > > 060308 132336 task_r_9swl2k Client connection to 207.241.228.28:8009: > starting > 060308 132336 task_r_9swl2k parsing > file:/0/hadoop/nara/app/runtime-conf/hadoop-default.xml > 060308 132336 task_r_9swl2k parsing > file:/0/hadoop/nara/app/runtime-conf/hadoop-site.xml > 060308 132336 Server connection on port 50050 from 207.241.227.176: > starting > 060308 132336 task_r_9swl2k 0.75% reduce > reduce060308 132336 > task_r_9swl2k Client connection to 0.0.0.0:50050: starting > 060308 132339 task_r_9swl2k Error running child060308 132339 > task_r_9swl2k java.lang.RuntimeException: java.io.EOFException > 060308 132339 task_r_9swl2k at > org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:132) > > 060308 132339 task_r_9swl2k at > org.apache.nutch.crawl.LinkDb.reduce(LinkDb.java:108) > 060308 132339 task_r_9swl2k at > org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:283) > 060308 132339 task_r_9swl2k at > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:666) > 060308 132339 task_r_9swl2k Caused by: java.io.EOFException > 060308 132339 task_r_9swl2k at > java.io.DataInputStream.readFully(DataInputStream.java:178) > 060308 132339 task_r_9swl2k at > org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55) > > 060308 132339 task_r_9swl2k at > org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:89) > 060308 132339 task_r_9swl2k at > org.apache.hadoop.io.UTF8.readChars(UTF8.java:212) > 060308 132339 task_r_9swl2k at > org.apache.hadoop.io.UTF8.readString(UTF8.java:204) > 060308 132339 task_r_9swl2k at > org.apache.nutch.crawl.Inlink.readFields(Inlink.java:36) > 060308 132339 task_r_9swl2k at > org.apache.nutch.crawl.Inlink.read(Inlink.java:53) 060308 132339 > task_r_9swl2k at > org.apache.nutch.crawl.Inlinks.readFields(Inlinks.java:44) > 060308 132339 task_r_9swl2k at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:347) > 060308 132339 task_r_9swl2k at > org.apache.hadoop.mapred.ReduceTask$ValuesIterator.getNext(ReduceTask.java:163) > > 060308 132339 task_r_9swl2k at > org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:129) > > 060308 132339 task_r_9swl2k ... 3 more > 060308 132340 Server connection on port 50050 from 207.241.227.176: > exiting060308 132340 KILLING CHILD PROCESS task_r_9swl2k > > A thread dump from job tracker shows a bunch of threads in this state: > > Full thread dump Java HotSpot(TM) Server VM (1.5.0_06-b05 mixed mode): > "Server connection on port 8010 from xxx.xxx.xxx.xxx" daemon prio=1 > tid=0xad324720 nid=0x2074 runnable [0xac07e000..0xac07ee40] at > java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:129) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read(BufferedInputStream.java:235) > - locked <0xb4570d08> (a java.io.BufferedInputStream) > at java.io.DataInputStream.readInt(DataInputStream.java:353) > at org.apache.hadoop.ipc.Server$Connection.run(Server.java:129) > > If I connect to the mentioned tasktrackers, the tasktrackers of IP > xxx.xxx.xxx.xxx from the jobtracker thread dump, no children are > running.... > > Any pointers appreciated. Meantime will keep digging. > > St.Ack