Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 66998 invoked from network); 30 Oct 2007 20:36:10 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Oct 2007 20:36:10 -0000 Received: (qmail 29032 invoked by uid 500); 30 Oct 2007 20:35:55 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 29002 invoked by uid 500); 30 Oct 2007 20:35:55 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 28993 invoked by uid 99); 30 Oct 2007 20:35:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Oct 2007 13:35:55 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of stuhood@webmail.us designates 207.97.245.111 as permitted sender) Received: from [207.97.245.111] (HELO smtp111.iad.emailsrvr.com) (207.97.245.111) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Oct 2007 20:36:11 +0000 Received: from relay1.r1.iad.emailsrvr.com (localhost [127.0.0.1]) by relay1.r1.iad.emailsrvr.com (SMTP Server) with ESMTP id 301CD44C7B3 for ; Tue, 30 Oct 2007 16:35:37 -0400 (EDT) Received: from webmail.us (webmail12.webmail.iad.mlsrvr.com [192.168.1.33]) by relay1.r1.iad.emailsrvr.com (SMTP Server) with ESMTP id 29C5444C102 for ; Tue, 30 Oct 2007 16:35:37 -0400 (EDT) Received: by beta.webmail.us (Authenticated sender: stuhood@webmail.us, from: stuhood@webmail.us) with HTTP; Tue, 30 Oct 2007 16:35:37 -0400 (EDT) Date: Tue, 30 Oct 2007 16:35:37 -0400 (EDT) Subject: Re: Job hangs, strange error messages...help sought... From: "Stu Hood" To: hadoop-user@lucene.apache.org Reply-To: stuhood@webmail.us MIME-Version: 1.0 Content-Type: text/plain;charset=UTF-8 Content-Transfer-Encoding: quoted-printable Importance: Normal X-Priority: 3 (Normal) X-Type: 1 Message-ID: <56086.192.168.1.35.1193776537.webmail@192.168.1.35> X-Mailer: webmail6.5b X-Virus-Checked: Checked by ClamAV on apache.org The files in the _task* subfolders are expected. Any task executed by Hadoo= p gets its own working directory, and writes its output there. When one att= empt of the task completes successfully, Hadoop copies its successful outpu= t from its _task* directory to the final destination.=0A=0AThat way, multip= le attempts of the same task don't have colliding output files.=0A=0AThanks= ,=0AStu=0A=0A=0A-----Original Message-----=0AFrom: C G =0ASent: Tuesday, October 30, 2007 4:23pm=0ATo: hadoop-user@lucene.apach= e.org=0ASubject: Re: Job hangs, strange error messages...help sought...=0A= =0AStill more weirdness. When I run my code on a single node hadoop instal= lation it runs fine, albeit performance is terrible. When I move to 2 nods= , I see the various error messages I've previously described. The system i= s also writing directories of stuff to my output when I am only expecting t= o see reducer output, and logging those writes in the jobtracker log thusly= :=0A =0A2007-10-30 16:14:33,264 INFO org.apache.hadoop.mapred.JobTracker: = Adding task 'task_200710301552_0002_r_000007_0' to tip tip_200710301552_000= 2_r_000007, for tracker 'tracker_localhost.localdomain:50050'=0A2007-10-30 = 16:14:33,628 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate'= heartbeat from 'tracker_localhost.localdomain:50050'=0A2007-10-30 16:14:41= ,359 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbe= at from 'tracker_localhost.localdomain:50050'=0A2007-10-30 16:14:41,722 INF= O org.apache.hadoop.mapred.JobInProgress: Task 'task_200710301552_0002_r_00= 0007_0' has completed tip_200710301552_0002_r_000007 successfully.=0A2007-1= 0-30 16:14:41,729 INFO org.apache.hadoop.mapred.TaskRunner: Saved output of= task 'task_200710301552_0002_r_000007_0' to hdfs://10.2.11.1:54310/import/= raw_logs/20070929=0A =0A So instead of reducer output, there is a combin= ation of reducer output and task directories:=0A /import/raw_logs/20070929= /_task_200710301552_0002_r_000001_0 2007-10-30 16:15=0A/imp= ort/raw_logs/20070929/_task_200710301552_0002_r_000002_0 20= 07-10-30 16:14=0A/import/raw_logs/20070929/_task_200710301552_0002_r_000003= _0 2007-10-30 16:14=0A/import/raw_logs/20070929/_task_20071= 0301552_0002_r_000004_0 2007-10-30 16:14=0A/import/raw_logs= /20070929/_task_200710301552_0002_r_000006_0 2007-10-30 16:= 14=0A/import/raw_logs/20070929/part-00000 2323852 2007-10-30 16:13= =0A/import/raw_logs/20070929/part-00001 2271699 2007-10-30 16:13=0A= /import/raw_logs/20070929/part-00003 2358158 2007-10-30 16:14=0A/im= port/raw_logs/20070929/part-00004 2341299 2007-10-30 16:14=0A/impor= t/raw_logs/20070929/part-00005 2265270 2007-10-30 16:14=0A/import/r= aw_logs/20070929/part-00006 2305368 2007-10-30 16:14=0A/import/raw_= logs/20070929/part-00007 2292513 2007-10-30 16:14=0A=0A those _tas= k directories contain output that appears at least in file size to be ident= ical to the reducer output:=0A =0A bin/hadoop dfs -ls /import/raw_logs/2= 0070929/_task_200710301552_0002_r_000001_0=0AFound 1 items=0A/import/raw_lo= gs/20070929/_task_200710301552_0002_r_000001_0/part-00001 2271= 699 2007-10-30 16:15=0A =0A Has anybody seen this behavior before? Thi= s is happening for any map/reduce program I run - we have several. All th= e code is basic stuff - nothing exotic happening at all.=0A =0A Any advi= ce much appreciated...=0A =0A Thanks,=0A C G=0A =0AC G wrote:=0A Each node runs 1 datanode instance. I did notice this o= n "node4" trying to send to "node2":=0A=0A2007-10-30 14:35:16,876 WARN org.= apache.hadoop.dfs.DataNode: Failed to transfer blk_7852052048957161201 to 1= 0.2.11.2:50010 got java.net.SocketExcep=0Ation: Connection reset=0Aat java.= net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)=0Aat java.ne= t.SocketOutputStream.write(SocketOutputStream.java:136)=0Aat java.io.Buffer= edOutputStream.flushBuffer(BufferedOutputStream.java:65)=0Aat java.io.Buffe= redOutputStream.write(BufferedOutputStream.java:109)=0Aat java.io.DataOutpu= tStream.write(DataOutputStream.java:90)=0Aat org.apache.hadoop.dfs.DataNode= sendBlock(DataNode.java:1231)=0Aat org.apache.hadoop.dfs.DataNode$DataTran= sfer.run(DataNode.java:1280)=0Aat java.lang.Thread.run(Thread.java:619)=0A= =0Aand then on node2:=0A2007-10-30 14:35:16,918 INFO org.apache.hadoop.dfs.= DataNode: Received block blk_7852052048957161201 from /10.2.11.3=0A2007-10-= 30 14:35:16,919 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: java.io.= IOException: Block blk_7852052048957161201 is valid, and cannot=0Abe writte= n to.=0Aat org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:515)= =0Aat org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:8= 22)=0Aat org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:727)= =0Aat java.lang.Thread.run(Thread.java:619)=0A=0ATed Dunning wrote:=0A=0AIs= it possible you have some nodes running multiple instances of datanodes?= =0A=0A=0AOn 10/30/07 11:06 AM, "C G" =0Awrote:=0A=0A> Hi All:=0A> =0A> Envi= ronment: 4 node grid running hadoop-0.14.1.=0A> =0A> With the system shutdo= wn I wiped out the old HDFS directory structure and=0A> created an empty di= rectory. Did a namenode format, and then brought up the=0A> system with sta= rt-all.sh.=0A> =0A> I then set up a directory structure, and ran the first = job. The job runs=0A> 100% of the map jobs, completes ~ 87% of the reduce j= obs, and then hangs.=0A> There are no user-level error messages. All system= s go to idle.=0A> =0A> I started looking at the Hadoop logs, first strange = message from the=0A> namenode log:=0A> =0A> 2007-10-30 13:48:01,991 WARN or= g.apache.hadoop.dfs.StateChange: DIR*=0A> NameSystem.startFile: failed to c= reate file /import/raw_logs/20070929/_t=0A> ask_200710301345_0001_r_000001_= 0/part-00001 for=0A> DFSClient_task_200710301345_0001_r_000001_0 on client = 10.2.11.4 because=0A> current leaseholder i=0A> s trying to recreate file.= =0A> 2007-10-30 13:48:01,992 INFO org.apache.hadoop.ipc.Server: IPC Server = handler=0A> 9 on 54310, call create(/import/raw_logs/20070929/_task_2007103= =0A> 01345_0001_r_000001_0/part-00001, DFSClient_task_200710301345_0001_r_0= 00001_0,=0A> true, 3, 67108864) from 10.2.11.4:34016: error: org.apache.had= =0A> oop.dfs.AlreadyBeingCreatedException: failed to create file=0A> /impor= t/raw_logs/20070929/_task_200710301345_0001_r_000001_0/part-00001 for=0A> D= FSCl=0A> ient_task_200710301345_0001_r_000001_0 on client 10.2.11.4 because= current=0A> leaseholder is trying to recreate file.=0A> org.apache.hadoop.= dfs.AlreadyBeingCreatedException: failed to create file=0A> /import/raw_log= s/20070929/_task_200710301345_0001_r_000001_0/part-0=0A> 0001 for DFSClient= _task_200710301345_0001_r_000001_0 on client 10.2.11.4=0A> because current = leaseholder is trying to recreate file.=0A> at =0A> org.apache.hadoop.dfs.F= SNamesystem.startFileInternal(FSNamesystem.java:788)=0A> at org.apache.hado= op.dfs.FSNamesystem.startFile(FSNamesystem.java:725)=0A> at org.apache.hado= op.dfs.NameNode.create(NameNode.java:307)=0A> at sun.reflect.NativeMethodAc= cessorImpl.invoke0(Native Method)=0A> at =0A> sun.reflect.NativeMethodAcces= sorImpl.invoke(NativeMethodAccessorImpl.java:39)=0A> at =0A> sun.reflect.De= legatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j=0A> ava:25= )=0A> at java.lang.reflect.Method.invoke(Method.java:597)=0A> at org.apache= hadoop.ipc.RPC$Server.call(RPC.java:340)=0A> at org.apache.hadoop.ipc.Serv= er$Handler.run(Server.java:566)=0A> =0A> Second strange error message comes= from the jobtracker log:=0A> 2007-10-30 13:59:26,190 INFO org.apache.hadoo= p.mapred.JobTracker: Ignoring=0A> 'duplicate' heartbeat from 'tracker_local= host.localdomain:50050'=0A> =0A> I'm curious about how to proceed. I suspec= t that my code is OK as I've run=0A> it numerous times in both single node = and multiple grid environments. I've=0A> never seen these error messages be= fore.=0A> =0A> Any help much appreciated....=0A> =0A> Thanks,=0A> C G =0A> = =0A> __________________________________________________=0A> Do You Yahoo!?= =0A> Tired of spam? Yahoo! Mail has the best spam protection around=0A> htt= p://mail.yahoo.com =0A=0A=0A=0A____________________________________________= ______=0ADo You Yahoo!?=0ATired of spam? Yahoo! Mail has the best spam prot= ection around =0Ahttp://mail.yahoo.com =0A=0A _____________________________= _____________________=0ADo You Yahoo!?=0ATired of spam? Yahoo! Mail has th= e best spam protection around =0Ahttp://mail.yahoo.com