Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of stuhood@webmail.us designates
 207.97.245.111 as permitted sender)
Date: Tue, 30 Oct 2007 16:35:37 -0400 (EDT)
Subject: Re: Job hangs, strange error messages...help sought...
From: "Stu Hood" <stuhood@webmail.us>
To: hadoop-user@lucene.apache.org
Reply-To: stuhood@webmail.us
MIME-Version: 1.0
Content-Type: text/plain;charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Importance: Normal
Message-ID: <56086.192.168.1.35.1193776537.webmail@192.168.1.35>

The files in the _task* subfolders are expected. Any task executed by Hadoo=
p gets its own working directory, and writes its output there. When one att=
empt of the task completes successfully, Hadoop copies its successful outpu=
t from its _task* directory to the final destination.=0A=0AThat way, multip=
le attempts of the same task don't have colliding output files.=0A=0AThanks=
,=0AStu=0A=0A=0A-----Original Message-----=0AFrom: C G <parallelguy@yahoo.c=
om>=0ASent: Tuesday, October 30, 2007 4:23pm=0ATo: hadoop-user@lucene.apach=
e.org=0ASubject: Re: Job hangs, strange error messages...help sought...=0A=
=0AStill more weirdness.  When I run my code on a single node hadoop instal=
lation it runs fine, albeit performance is terrible.  When I move to 2 nods=
, I see the various error messages I've previously described.  The system i=
s also writing directories of stuff to my output when I am only expecting t=
o see reducer output, and logging those writes in the jobtracker log thusly=
:=0A  =0A2007-10-30 16:14:33,264 INFO org.apache.hadoop.mapred.JobTracker: =
Adding task 'task_200710301552_0002_r_000007_0' to tip tip_200710301552_000=
2_r_000007, for tracker 'tracker_localhost.localdomain:50050'=0A2007-10-30 =
16:14:33,628 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate'=
 heartbeat from 'tracker_localhost.localdomain:50050'=0A2007-10-30 16:14:41=
,359 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbe=
at from 'tracker_localhost.localdomain:50050'=0A2007-10-30 16:14:41,722 INF=
O org.apache.hadoop.mapred.JobInProgress: Task 'task_200710301552_0002_r_00=
0007_0' has completed tip_200710301552_0002_r_000007 successfully.=0A2007-1=
0-30 16:14:41,729 INFO org.apache.hadoop.mapred.TaskRunner: Saved output of=
 task 'task_200710301552_0002_r_000007_0' to hdfs://10.2.11.1:54310/import/=
raw_logs/20070929=0A   =0A  So instead of reducer output, there is a combin=
ation of reducer output and task directories:=0A  /import/raw_logs/20070929=
/_task_200710301552_0002_r_000001_0 <dir>           2007-10-30 16:15=0A/imp=
ort/raw_logs/20070929/_task_200710301552_0002_r_000002_0 <dir>           20=
07-10-30 16:14=0A/import/raw_logs/20070929/_task_200710301552_0002_r_000003=
_0 <dir>           2007-10-30 16:14=0A/import/raw_logs/20070929/_task_20071=
0301552_0002_r_000004_0 <dir>           2007-10-30 16:14=0A/import/raw_logs=
/20070929/_task_200710301552_0002_r_000006_0 <dir>           2007-10-30 16:=
14=0A/import/raw_logs/20070929/part-00000 <r 3>   2323852 2007-10-30 16:13=
=0A/import/raw_logs/20070929/part-00001 <r 3>   2271699 2007-10-30 16:13=0A=
/import/raw_logs/20070929/part-00003 <r 3>   2358158 2007-10-30 16:14=0A/im=
port/raw_logs/20070929/part-00004 <r 3>   2341299 2007-10-30 16:14=0A/impor=
t/raw_logs/20070929/part-00005 <r 3>   2265270 2007-10-30 16:14=0A/import/r=
aw_logs/20070929/part-00006 <r 3>   2305368 2007-10-30 16:14=0A/import/raw_=
logs/20070929/part-00007 <r 3>   2292513 2007-10-30 16:14=0A=0A  those _tas=
k directories contain output that appears at least in file size to be ident=
ical to the reducer output:=0A   =0A  bin/hadoop dfs -ls /import/raw_logs/2=
0070929/_task_200710301552_0002_r_000001_0=0AFound 1 items=0A/import/raw_lo=
gs/20070929/_task_200710301552_0002_r_000001_0/part-00001      <r 3>   2271=
699 2007-10-30 16:15=0A   =0A  Has anybody seen this behavior before?   Thi=
s is happening for any map/reduce program I run - we have several.   All th=
e code is basic stuff - nothing exotic happening at all.=0A   =0A  Any advi=
ce much appreciated...=0A   =0A  Thanks,=0A  C G=0A  =0AC G <parallelguy@ya=
hoo.com> wrote:=0A  Each node runs 1 datanode instance. I did notice this o=
n "node4" trying to send to "node2":=0A=0A2007-10-30 14:35:16,876 WARN org.=
apache.hadoop.dfs.DataNode: Failed to transfer blk_7852052048957161201 to 1=
0.2.11.2:50010 got java.net.SocketExcep=0Ation: Connection reset=0Aat java.=
net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)=0Aat java.ne=
t.SocketOutputStream.write(SocketOutputStream.java:136)=0Aat java.io.Buffer=
edOutputStream.flushBuffer(BufferedOutputStream.java:65)=0Aat java.io.Buffe=
redOutputStream.write(BufferedOutputStream.java:109)=0Aat java.io.DataOutpu=
tStream.write(DataOutputStream.java:90)=0Aat org.apache.hadoop.dfs.DataNode=
sendBlock(DataNode.java:1231)=0Aat org.apache.hadoop.dfs.DataNode$DataTran=
sfer.run(DataNode.java:1280)=0Aat java.lang.Thread.run(Thread.java:619)=0A=
=0Aand then on node2:=0A2007-10-30 14:35:16,918 INFO org.apache.hadoop.dfs.=
DataNode: Received block blk_7852052048957161201 from /10.2.11.3=0A2007-10-=
30 14:35:16,919 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: java.io.=
IOException: Block blk_7852052048957161201 is valid, and cannot=0Abe writte=
n to.=0Aat org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:515)=
=0Aat org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:8=
22)=0Aat org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:727)=
=0Aat java.lang.Thread.run(Thread.java:619)=0A=0ATed Dunning wrote:=0A=0AIs=
 it possible you have some nodes running multiple instances of datanodes?=
=0A=0A=0AOn 10/30/07 11:06 AM, "C G" =0Awrote:=0A=0A> Hi All:=0A> =0A> Envi=
ronment: 4 node grid running hadoop-0.14.1.=0A> =0A> With the system shutdo=
wn I wiped out the old HDFS directory structure and=0A> created an empty di=
rectory. Did a namenode format, and then brought up the=0A> system with sta=
rt-all.sh.=0A> =0A> I then set up a directory structure, and ran the first =
job. The job runs=0A> 100% of the map jobs, completes ~ 87% of the reduce j=
obs, and then hangs.=0A> There are no user-level error messages. All system=
s go to idle.=0A> =0A> I started looking at the Hadoop logs, first strange =
message from the=0A> namenode log:=0A> =0A> 2007-10-30 13:48:01,991 WARN or=
g.apache.hadoop.dfs.StateChange: DIR*=0A> NameSystem.startFile: failed to c=
reate file /import/raw_logs/20070929/_t=0A> ask_200710301345_0001_r_000001_=
0/part-00001 for=0A> DFSClient_task_200710301345_0001_r_000001_0 on client =
10.2.11.4 because=0A> current leaseholder i=0A> s trying to recreate file.=
=0A> 2007-10-30 13:48:01,992 INFO org.apache.hadoop.ipc.Server: IPC Server =
handler=0A> 9 on 54310, call create(/import/raw_logs/20070929/_task_2007103=
=0A> 01345_0001_r_000001_0/part-00001, DFSClient_task_200710301345_0001_r_0=
00001_0,=0A> true, 3, 67108864) from 10.2.11.4:34016: error: org.apache.had=
=0A> oop.dfs.AlreadyBeingCreatedException: failed to create file=0A> /impor=
t/raw_logs/20070929/_task_200710301345_0001_r_000001_0/part-00001 for=0A> D=
FSCl=0A> ient_task_200710301345_0001_r_000001_0 on client 10.2.11.4 because=
 current=0A> leaseholder is trying to recreate file.=0A> org.apache.hadoop.=
dfs.AlreadyBeingCreatedException: failed to create file=0A> /import/raw_log=
s/20070929/_task_200710301345_0001_r_000001_0/part-0=0A> 0001 for DFSClient=
_task_200710301345_0001_r_000001_0 on client 10.2.11.4=0A> because current =
leaseholder is trying to recreate file.=0A> at =0A> org.apache.hadoop.dfs.F=
SNamesystem.startFileInternal(FSNamesystem.java:788)=0A> at org.apache.hado=
op.dfs.FSNamesystem.startFile(FSNamesystem.java:725)=0A> at org.apache.hado=
op.dfs.NameNode.create(NameNode.java:307)=0A> at sun.reflect.NativeMethodAc=
cessorImpl.invoke0(Native Method)=0A> at =0A> sun.reflect.NativeMethodAcces=
sorImpl.invoke(NativeMethodAccessorImpl.java:39)=0A> at =0A> sun.reflect.De=
legatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j=0A> ava:25=
)=0A> at java.lang.reflect.Method.invoke(Method.java:597)=0A> at org.apache=
hadoop.ipc.RPC$Server.call(RPC.java:340)=0A> at org.apache.hadoop.ipc.Serv=
er$Handler.run(Server.java:566)=0A> =0A> Second strange error message comes=
 from the jobtracker log:=0A> 2007-10-30 13:59:26,190 INFO org.apache.hadoo=
p.mapred.JobTracker: Ignoring=0A> 'duplicate' heartbeat from 'tracker_local=
host.localdomain:50050'=0A> =0A> I'm curious about how to proceed. I suspec=
t that my code is OK as I've run=0A> it numerous times in both single node =
and multiple grid environments. I've=0A> never seen these error messages be=
fore.=0A> =0A> Any help much appreciated....=0A> =0A> Thanks,=0A> C G =0A> =
=0A> __________________________________________________=0A> Do You Yahoo!?=
=0A> Tired of spam? Yahoo! Mail has the best spam protection around=0A> htt=
p://mail.yahoo.com =0A=0A=0A=0A____________________________________________=
______=0ADo You Yahoo!?=0ATired of spam? Yahoo! Mail has the best spam prot=
ection around =0Ahttp://mail.yahoo.com =0A=0A _____________________________=
_____________________=0ADo You Yahoo!?=0ATired of spam?  Yahoo! Mail has th=
e best spam protection around =0Ahttp://mail.yahoo.com