hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stu Hood" <stuh...@webmail.us>
Subject RE: extremely slow reduce jobs
Date Wed, 08 Aug 2007 17:11:26 GMT


Actually, it turns out that the reason one of my datanodes was "slow" was because it had not
responded at all. The datanode had failed to start because of a namespace mismatch (since
I've been running on hadoop-trunk). After reformatting the datanode the slow host messages
have disappeared.

I still have another issue that I need to resolve, but I'll send a separate subject to the
list about it.

Stu


-----Original Message-----
From: Joydeep Sen Sarma <jssarma@facebook.com>
Sent: Tue, August 7, 2007 8:41 pm
To: hadoop-user@lucene.apache.org, stuhood@webmail.us
Subject: RE: extremely slow reduce jobs

Thanks! This makes a lot of sense - I do see a lot of 'slow' hosts in
the logs.

Is it normal for hosts to be classified as 'slow'? (Given that there's
ample bandwidth and no errors - I don't know why a peer be classified as
such in the first place).

-----Original Message-----
From: Stu Hood [mailto:stuhood@webmail.us] 
Sent: Tuesday, August 07, 2007 2:58 PM
To: hadoop-user@lucene.apache.org
Subject: RE: extremely slow reduce jobs



Hi,

I have noticed the same thing. My cluster is only 2 machines, but I've
noticed "Scheduled 0 of 1 known outputs (1 slow hosts and 0 dup hosts)"
over and over in userlog/task*_r_*/syslog on the namenode machine.

I just found this thread which explains the problem:
[http://mail-archives.apache.org/mod_mbox/lucene-hadoop-dev/200702.mbox/
%3C03af01c75b01$0c3e1a80$2301a8c0@ds.corp.yahoo.com%3E]
http://mail-archives.apache.org/mod_mbox/lucene-hadoop-dev/200702.mbox/%
3C03af01c75b01$0c3e1a80$2301a8c0@ds.corp.yahoo.com%3E

Thanks,
Stu


-----Original Message-----
From: Joydeep Sen Sarma 
Sent: Fri, August 3, 2007 3:17 pm
To: hadoop-user@lucene.apache.org
Subject: extremely slow reduce jobs

I have a fairly simple job with a map, a local combiner and a reduce.
The combiner and the reduce do the equivalent of a group_concat (mysql).
 
 
I have horrible performance in the reduce stage:
- the map jobs are done
- all the reduce jobs claim they are copying data - but the copy rate is
abysmal (0.5MBps)
  - checked the network topology - everything's on GigE and on same
switch. (80 machine cluster)
  - seeing 50+ MBps bandwidth between any pair using scp
- when I look at the machines where reduce is running - vmstat says 0%
cpu util.
 
A sample reducetask log is below. Job conf: 64 way reduce. I specified
the map tasks to the same number - but hadoop is anyway creating 386 map
tasks. 
 
Anyone has some quick hints on what could be going wrong?
 
Thanks,
 
Joydeep
 
2007-08-03 12:06:54,408 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Got 2 known map output location(s); scheduling...
2007-08-03 12:06:54,408 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Scheduled 0 of 2 known outputs (2 slow hosts and 0
dup hosts)
2007-08-03 12:06:59,409 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Need 1 map output(s)
2007-08-03 12:06:59,410 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Got 0 new map outputs from tasktracker and 0 map
outputs from previous failures
2007-08-03 12:06:59,410 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Got 2 known map output location(s); scheduling...
2007-08-03 12:06:59,410 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Scheduled 0 of 2 known outputs (2 slow hosts and 0
dup hosts)
2007-08-03 12:07:04,411 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Need 1 map output(s)
2007-08-03 12:07:04,412 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Got 0 new map outputs from tasktracker and 0 map
outputs from previous failures
2007-08-03 12:07:04,412 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Got 2 known map output location(s); scheduling...
2007-08-03 12:07:04,412 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Scheduled 0 of 2 known outputs (2 slow hosts and 0
dup hosts)
2007-08-03 12:07:09,413 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Need 1 map output(s)
2007-08-03 12:07:09,413 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Got 0 new map outputs from tasktracker and 0 map
outputs from previous failures
2007-08-03 12:07:09,413 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Got 2 known map output location(s); scheduling...
2007-08-03 12:07:09,413 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Scheduled 0 of 2 known outputs (2 slow hosts and 0
dup hosts)
2007-08-03 12:07:14,415 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Need 1 map output(s)
2007-08-03 12:07:14,415 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Got 0 new map outputs from tasktracker and 0 map
outputs from previous failures
2007-08-03 12:07:14,415 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Got 2 known map output location(s); scheduling...
2007-08-03 12:07:14,415 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Scheduled 0 of 2 known outputs (2 slow hosts and 0
dup hosts)
2007-08-03 12:07:19,417 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Need 1 map output(s)
2007-08-03 12:07:19,418 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Got 0 new map outputs from tasktracker and 0 map
outputs from previous failures
2007-08-03 12:07:19,418 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Got 2 known map output location(s); scheduling...
2007-08-03 12:07:19,418 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Scheduled 0 of 2 known outputs (2 slow hosts and 0
dup hosts)
2007-08-03 12:07:24,419 INFO org.apache.hadoop.mapred.ReduceTask:
task_0169_r_000010_0 Need 1 map output(s)

 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message