hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stu Hood" <stuh...@webmail.us>
Subject Re: Job hangs, strange error messages...help sought...
Date Tue, 30 Oct 2007 20:35:37 GMT
The files in the _task* subfolders are expected. Any task executed by Hadoop gets its own working
directory, and writes its output there. When one attempt of the task completes successfully,
Hadoop copies its successful output from its _task* directory to the final destination.

That way, multiple attempts of the same task don't have colliding output files.


-----Original Message-----
From: C G <parallelguy@yahoo.com>
Sent: Tuesday, October 30, 2007 4:23pm
To: hadoop-user@lucene.apache.org
Subject: Re: Job hangs, strange error messages...help sought...

Still more weirdness.  When I run my code on a single node hadoop installation it runs fine,
albeit performance is terrible.  When I move to 2 nods, I see the various error messages I've
previously described.  The system is also writing directories of stuff to my output when I
am only expecting to see reducer output, and logging those writes in the jobtracker log thusly:
2007-10-30 16:14:33,264 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200710301552_0002_r_000007_0'
to tip tip_200710301552_0002_r_000007, for tracker 'tracker_localhost.localdomain:50050'
2007-10-30 16:14:33,628 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat
from 'tracker_localhost.localdomain:50050'
2007-10-30 16:14:41,359 INFO org.apache.hadoop.mapred.JobTracker: Ignoring 'duplicate' heartbeat
from 'tracker_localhost.localdomain:50050'
2007-10-30 16:14:41,722 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_200710301552_0002_r_000007_0'
has completed tip_200710301552_0002_r_000007 successfully.
2007-10-30 16:14:41,729 INFO org.apache.hadoop.mapred.TaskRunner: Saved output of task 'task_200710301552_0002_r_000007_0'
to hdfs://
  So instead of reducer output, there is a combination of reducer output and task directories:
  /import/raw_logs/20070929/_task_200710301552_0002_r_000001_0 <dir>           2007-10-30
/import/raw_logs/20070929/_task_200710301552_0002_r_000002_0 <dir>           2007-10-30
/import/raw_logs/20070929/_task_200710301552_0002_r_000003_0 <dir>           2007-10-30
/import/raw_logs/20070929/_task_200710301552_0002_r_000004_0 <dir>           2007-10-30
/import/raw_logs/20070929/_task_200710301552_0002_r_000006_0 <dir>           2007-10-30
/import/raw_logs/20070929/part-00000 <r 3>   2323852 2007-10-30 16:13
/import/raw_logs/20070929/part-00001 <r 3>   2271699 2007-10-30 16:13
/import/raw_logs/20070929/part-00003 <r 3>   2358158 2007-10-30 16:14
/import/raw_logs/20070929/part-00004 <r 3>   2341299 2007-10-30 16:14
/import/raw_logs/20070929/part-00005 <r 3>   2265270 2007-10-30 16:14
/import/raw_logs/20070929/part-00006 <r 3>   2305368 2007-10-30 16:14
/import/raw_logs/20070929/part-00007 <r 3>   2292513 2007-10-30 16:14

  those _task directories contain output that appears at least in file size to be identical
to the reducer output:
  bin/hadoop dfs -ls /import/raw_logs/20070929/_task_200710301552_0002_r_000001_0
Found 1 items
/import/raw_logs/20070929/_task_200710301552_0002_r_000001_0/part-00001      <r 3> 
 2271699 2007-10-30 16:15
  Has anybody seen this behavior before?   This is happening for any map/reduce program I
run - we have several.   All the code is basic stuff - nothing exotic happening at all.
  Any advice much appreciated...
  C G
C G <parallelguy@yahoo.com> wrote:
  Each node runs 1 datanode instance. I did notice this on "node4" trying to send to "node2":

2007-10-30 14:35:16,876 WARN org.apache.hadoop.dfs.DataNode: Failed to transfer blk_7852052048957161201
to got java.net.SocketExcep
tion: Connection reset
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.dfs.DataNodesendBlock(DataNode.java:1231)
at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:1280)
at java.lang.Thread.run(Thread.java:619)

and then on node2:
2007-10-30 14:35:16,918 INFO org.apache.hadoop.dfs.DataNode: Received block blk_7852052048957161201
from /
2007-10-30 14:35:16,919 ERROR org.apache.hadoop.dfs.DataNode: DataXceiver: java.io.IOException:
Block blk_7852052048957161201 is valid, and cannot
be written to.
at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:515)
at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:822)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:727)
at java.lang.Thread.run(Thread.java:619)

Ted Dunning wrote:

Is it possible you have some nodes running multiple instances of datanodes?

On 10/30/07 11:06 AM, "C G" 

> Hi All:
> Environment: 4 node grid running hadoop-0.14.1.
> With the system shutdown I wiped out the old HDFS directory structure and
> created an empty directory. Did a namenode format, and then brought up the
> system with start-all.sh.
> I then set up a directory structure, and ran the first job. The job runs
> 100% of the map jobs, completes ~ 87% of the reduce jobs, and then hangs.
> There are no user-level error messages. All systems go to idle.
> I started looking at the Hadoop logs, first strange message from the
> namenode log:
> 2007-10-30 13:48:01,991 WARN org.apache.hadoop.dfs.StateChange: DIR*
> NameSystem.startFile: failed to create file /import/raw_logs/20070929/_t
> ask_200710301345_0001_r_000001_0/part-00001 for
> DFSClient_task_200710301345_0001_r_000001_0 on client because
> current leaseholder i
> s trying to recreate file.
> 2007-10-30 13:48:01,992 INFO org.apache.hadoop.ipc.Server: IPC Server handler
> 9 on 54310, call create(/import/raw_logs/20070929/_task_2007103
> 01345_0001_r_000001_0/part-00001, DFSClient_task_200710301345_0001_r_000001_0,
> true, 3, 67108864) from error: org.apache.had
> oop.dfs.AlreadyBeingCreatedException: failed to create file
> /import/raw_logs/20070929/_task_200710301345_0001_r_000001_0/part-00001 for
> ient_task_200710301345_0001_r_000001_0 on client because current
> leaseholder is trying to recreate file.
> org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create file
> /import/raw_logs/20070929/_task_200710301345_0001_r_000001_0/part-0
> 0001 for DFSClient_task_200710301345_0001_r_000001_0 on client
> because current leaseholder is trying to recreate file.
> at 
> org.apache.hadoop.dfs.FSNamesystem.startFileInternal(FSNamesystem.java:788)
> at org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:725)
> at org.apache.hadoop.dfs.NameNode.create(NameNode.java:307)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j
> ava:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apachehadoop.ipc.RPC$Server.call(RPC.java:340)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:566)
> Second strange error message comes from the jobtracker log:
> 2007-10-30 13:59:26,190 INFO org.apache.hadoop.mapred.JobTracker: Ignoring
> 'duplicate' heartbeat from 'tracker_localhost.localdomain:50050'
> I'm curious about how to proceed. I suspect that my code is OK as I've run
> it numerous times in both single node and multiple grid environments. I've
> never seen these error messages before.
> Any help much appreciated....
> Thanks,
> C G 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com 

Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around 

Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 

View raw message