hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tanveer Faruquie <ftanv...@gmail.com>
Subject Reduce in hadoop streaming takes longer
Date Sat, 26 Jun 2010 08:15:25 GMT
Hi,

I am trying to use Hadoop streaming to evaluate a set of data points
for a given rule. In the end I have to find on how many data points
the rule was true. Each data point is on a single line and the mapper
outputs either 1 or 0 if the rule for that data point is true or false
respectively. The reducer sums the output of mapper to give the final
count of the rule on all the examples.

While running this using hadoop streaming I notice that the reduce
phase takes considerably more time than the mapper. Moreover the
reduce times increases with data size. Since my job is equivalent to
counting the number of words in a file I timed a streaming version of
word count and the regular java version where each word is replaced by
the same word ('foo'). The mapper and reducers are written in python
for streaming. I ran this on 1.5 GB of data.

The streaming version took 29mins while java version took around
2min44sec on the same hadoop cluster (consisting of 3 machines). The
map phase in both the cases took approximately the same amount of time
whereas the reduce phase took much longer for streaming.

Does anyone know what might be the problem and if possible how to
reduce the time taken by reduce phase for streaming?

Regards,
Tanveer

PS: These are the output for the test run.

using java for word count

$ time $HADOOP_HOME/bin/hadoop jar ./wordcount.jar WordCount text text-out
10/06/25 21:58:15 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the
same.
10/06/25 21:58:15 INFO mapred.FileInputFormat: Total input paths to process : 86
10/06/25 21:58:15 INFO mapred.FileInputFormat: Total input paths to process : 86
10/06/25 21:58:16 INFO mapred.JobClient: Running job: job_201006252042_0005
10/06/25 21:58:17 INFO mapred.JobClient:  map 0% reduce 0%
10/06/25 21:58:30 INFO mapred.JobClient:  map 2% reduce 0%
10/06/25 21:58:35 INFO mapred.JobClient:  map 4% reduce 0%
10/06/25 21:58:38 INFO mapred.JobClient:  map 6% reduce 0%
10/06/25 21:58:40 INFO mapred.JobClient:  map 7% reduce 0%
10/06/25 21:58:42 INFO mapred.JobClient:  map 9% reduce 0%
10/06/25 21:58:44 INFO mapred.JobClient:  map 10% reduce 0%
10/06/25 21:58:45 INFO mapred.JobClient:  map 13% reduce 0%
10/06/25 21:58:46 INFO mapred.JobClient:  map 15% reduce 0%
10/06/25 21:58:48 INFO mapred.JobClient:  map 18% reduce 0%
10/06/25 21:58:49 INFO mapred.JobClient:  map 20% reduce 0%
10/06/25 21:58:50 INFO mapred.JobClient:  map 22% reduce 0%
10/06/25 21:58:51 INFO mapred.JobClient:  map 24% reduce 0%
10/06/25 21:58:52 INFO mapred.JobClient:  map 28% reduce 0%
10/06/25 21:58:54 INFO mapred.JobClient:  map 31% reduce 0%
10/06/25 21:58:55 INFO mapred.JobClient:  map 35% reduce 0%
10/06/25 21:58:56 INFO mapred.JobClient:  map 36% reduce 0%
10/06/25 21:58:57 INFO mapred.JobClient:  map 38% reduce 0%
10/06/25 21:58:58 INFO mapred.JobClient:  map 44% reduce 0%
10/06/25 21:58:59 INFO mapred.JobClient:  map 46% reduce 0%
10/06/25 21:59:00 INFO mapred.JobClient:  map 51% reduce 0%
10/06/25 21:59:01 INFO mapred.JobClient:  map 55% reduce 0%
10/06/25 21:59:02 INFO mapred.JobClient:  map 58% reduce 0%
10/06/25 21:59:03 INFO mapred.JobClient:  map 63% reduce 0%
10/06/25 21:59:04 INFO mapred.JobClient:  map 68% reduce 0%
10/06/25 21:59:05 INFO mapred.JobClient:  map 74% reduce 0%
10/06/25 21:59:06 INFO mapred.JobClient:  map 76% reduce 0%
10/06/25 21:59:07 INFO mapred.JobClient:  map 82% reduce 0%
10/06/25 21:59:08 INFO mapred.JobClient:  map 90% reduce 0%
10/06/25 21:59:14 INFO mapred.JobClient:  map 91% reduce 0%
10/06/25 21:59:19 INFO mapred.JobClient:  map 92% reduce 0%
10/06/25 21:59:21 INFO mapred.JobClient:  map 93% reduce 0%
10/06/25 21:59:23 INFO mapred.JobClient:  map 95% reduce 0%
10/06/25 21:59:28 INFO mapred.JobClient:  map 95% reduce 4%
10/06/25 21:59:33 INFO mapred.JobClient:  map 95% reduce 6%
10/06/25 21:59:34 INFO mapred.JobClient:  map 96% reduce 6%
10/06/25 21:59:38 INFO mapred.JobClient:  map 96% reduce 8%
10/06/25 21:59:39 INFO mapred.JobClient:  map 97% reduce 8%
10/06/25 21:59:42 INFO mapred.JobClient:  map 98% reduce 8%
10/06/25 21:59:43 INFO mapred.JobClient:  map 98% reduce 9%
10/06/25 21:59:46 INFO mapred.JobClient:  map 99% reduce 9%
10/06/25 21:59:48 INFO mapred.JobClient:  map 99% reduce 11%
10/06/25 21:59:50 INFO mapred.JobClient:  map 99% reduce 12%
10/06/25 21:59:55 INFO mapred.JobClient:  map 99% reduce 13%
10/06/25 21:59:58 INFO mapred.JobClient:  map 99% reduce 15%
10/06/25 22:00:03 INFO mapred.JobClient:  map 99% reduce 16%
10/06/25 22:00:04 INFO mapred.JobClient:  map 100% reduce 16%
10/06/25 22:00:05 INFO mapred.JobClient:  map 100% reduce 17%
10/06/25 22:00:10 INFO mapred.JobClient:  map 100% reduce 18%
10/06/25 22:00:13 INFO mapred.JobClient:  map 100% reduce 19%
10/06/25 22:00:15 INFO mapred.JobClient:  map 100% reduce 20%
10/06/25 22:00:18 INFO mapred.JobClient:  map 100% reduce 21%
10/06/25 22:00:20 INFO mapred.JobClient:  map 100% reduce 22%
10/06/25 22:00:25 INFO mapred.JobClient:  map 100% reduce 23%
10/06/25 22:00:28 INFO mapred.JobClient:  map 100% reduce 24%
10/06/25 22:00:30 INFO mapred.JobClient:  map 100% reduce 25%
10/06/25 22:00:33 INFO mapred.JobClient:  map 100% reduce 26%
10/06/25 22:00:40 INFO mapred.JobClient:  map 100% reduce 28%
10/06/25 22:00:45 INFO mapred.JobClient:  map 100% reduce 29%
10/06/25 22:00:48 INFO mapred.JobClient:  map 100% reduce 30%
10/06/25 22:00:50 INFO mapred.JobClient:  map 100% reduce 31%
10/06/25 22:00:55 INFO mapred.JobClient:  map 100% reduce 32%
10/06/25 22:00:57 INFO mapred.JobClient: Job complete: job_201006252042_0005
10/06/25 22:00:57 INFO mapred.JobClient: Counters: 17
10/06/25 22:00:57 INFO mapred.JobClient:   File Systems
10/06/25 22:00:57 INFO mapred.JobClient:     HDFS bytes read=1563645386
10/06/25 22:00:58 INFO mapred.JobClient:     HDFS bytes written=14
10/06/25 22:00:58 INFO mapred.JobClient:     Local bytes read=62374
10/06/25 22:00:58 INFO mapred.JobClient:     Local bytes written=68894
10/06/25 22:00:58 INFO mapred.JobClient:   Job Counters
10/06/25 22:00:58 INFO mapred.JobClient:     Launched reduce tasks=1
10/06/25 22:00:58 INFO mapred.JobClient:     Rack-local map tasks=62
10/06/25 22:00:58 INFO mapred.JobClient:     Launched map tasks=116
10/06/25 22:00:58 INFO mapred.JobClient:     Data-local map tasks=46
10/06/25 22:00:58 INFO mapred.JobClient:   Map-Reduce Framework
10/06/25 22:00:58 INFO mapred.JobClient:     Reduce input groups=1
10/06/25 22:00:58 INFO mapred.JobClient:     Combine output records=1001
10/06/25 22:00:58 INFO mapred.JobClient:     Map input records=33325068
10/06/25 22:00:58 INFO mapred.JobClient:     Reduce output records=1
10/06/25 22:00:58 INFO mapred.JobClient:     Map output bytes=1892793728
10/06/25 22:00:58 INFO mapred.JobClient:     Map input bytes=1563555252
10/06/25 22:00:58 INFO mapred.JobClient:     Combine input records=236600216
10/06/25 22:00:58 INFO mapred.JobClient:     Map output records=236599216
10/06/25 22:00:58 INFO mapred.JobClient:     Reduce input records=1

real    2m44.426s
user    0m0.784s
sys     0m0.228s


-----------------

using streaming for word count


$ time /hadoop-0.18.3/bin/hadoop jar
/hadoop-0.18.3/contrib/streaming/hadoop-0.18.3-streaming.jar -file
./mapper.py -mapper ./mapper.py -file ./reducer.py -reducer
./reducer.py -input text -output text-out
additionalConfSpec_:null
null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar: [./mapper.py, ./reducer.py,
/home/user/hroot/tmp/hadoop-unjar539717104050104853/] []
/tmp/streamjob5333570157243247862.jar tmpDir=null
10/06/25 22:02:22 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the
same.
10/06/25 22:02:22 INFO mapred.FileInputFormat: Total input paths to process : 86
10/06/25 22:02:22 INFO mapred.FileInputFormat: Total input paths to process : 86
10/06/25 22:02:22 INFO streaming.StreamJob: getLocalDirs():
[/home/user/hroot/tmp/mapred/local]
10/06/25 22:02:22 INFO streaming.StreamJob: Running job: job_201006252042_0006
10/06/25 22:02:22 INFO streaming.StreamJob: To kill this job, run:
10/06/25 22:02:22 INFO streaming.StreamJob:
/home/user/hadoop-0.18.3/bin/../bin/hadoop job
-Dmapred.job.tracker=hadoop1:19001 -kill job_201006252042_0006
10/06/25 22:02:22 INFO streaming.StreamJob: Tracking URL:
http://hadoop1:50030/jobdetails.jsp?jobid=job_201006252042_0006
10/06/25 22:02:24 INFO streaming.StreamJob:  map 0%  reduce 0%
10/06/25 22:02:34 INFO streaming.StreamJob:  map 1%  reduce 0%
10/06/25 22:02:39 INFO streaming.StreamJob:  map 2%  reduce 0%
10/06/25 22:02:44 INFO streaming.StreamJob:  map 3%  reduce 0%
10/06/25 22:02:49 INFO streaming.StreamJob:  map 5%  reduce 0%
10/06/25 22:02:50 INFO streaming.StreamJob:  map 6%  reduce 0%
10/06/25 22:02:55 INFO streaming.StreamJob:  map 7%  reduce 0%
10/06/25 22:02:56 INFO streaming.StreamJob:  map 10%  reduce 0%
10/06/25 22:02:57 INFO streaming.StreamJob:  map 12%  reduce 0%
10/06/25 22:02:59 INFO streaming.StreamJob:  map 13%  reduce 0%
10/06/25 22:03:01 INFO streaming.StreamJob:  map 14%  reduce 0%
10/06/25 22:03:02 INFO streaming.StreamJob:  map 15%  reduce 0%
10/06/25 22:03:03 INFO streaming.StreamJob:  map 21%  reduce 0%
10/06/25 22:03:04 INFO streaming.StreamJob:  map 23%  reduce 0%
10/06/25 22:03:07 INFO streaming.StreamJob:  map 26%  reduce 0%
10/06/25 22:03:09 INFO streaming.StreamJob:  map 27%  reduce 0%
10/06/25 22:03:10 INFO streaming.StreamJob:  map 28%  reduce 0%
10/06/25 22:03:11 INFO streaming.StreamJob:  map 30%  reduce 0%
10/06/25 22:03:12 INFO streaming.StreamJob:  map 33%  reduce 0%
10/06/25 22:03:15 INFO streaming.StreamJob:  map 36%  reduce 0%
10/06/25 22:03:16 INFO streaming.StreamJob:  map 40%  reduce 0%
10/06/25 22:03:17 INFO streaming.StreamJob:  map 43%  reduce 0%
10/06/25 22:03:20 INFO streaming.StreamJob:  map 48%  reduce 0%
10/06/25 22:03:21 INFO streaming.StreamJob:  map 50%  reduce 0%
10/06/25 22:03:22 INFO streaming.StreamJob:  map 52%  reduce 0%
10/06/25 22:03:23 INFO streaming.StreamJob:  map 54%  reduce 0%
10/06/25 22:03:26 INFO streaming.StreamJob:  map 56%  reduce 0%
10/06/25 22:03:27 INFO streaming.StreamJob:  map 58%  reduce 0%
10/06/25 22:03:28 INFO streaming.StreamJob:  map 62%  reduce 0%
10/06/25 22:03:30 INFO streaming.StreamJob:  map 64%  reduce 0%
10/06/25 22:03:31 INFO streaming.StreamJob:  map 67%  reduce 0%
10/06/25 22:03:32 INFO streaming.StreamJob:  map 68%  reduce 0%
10/06/25 22:03:35 INFO streaming.StreamJob:  map 70%  reduce 0%
10/06/25 22:03:36 INFO streaming.StreamJob:  map 76%  reduce 0%
10/06/25 22:03:37 INFO streaming.StreamJob:  map 79%  reduce 3%
10/06/25 22:03:40 INFO streaming.StreamJob:  map 82%  reduce 3%
10/06/25 22:03:41 INFO streaming.StreamJob:  map 84%  reduce 3%
10/06/25 22:03:42 INFO streaming.StreamJob:  map 89%  reduce 5%
10/06/25 22:03:49 INFO streaming.StreamJob:  map 92%  reduce 8%
10/06/25 22:03:53 INFO streaming.StreamJob:  map 93%  reduce 9%
10/06/25 22:03:56 INFO streaming.StreamJob:  map 94%  reduce 9%
10/06/25 22:03:59 INFO streaming.StreamJob:  map 95%  reduce 10%
10/06/25 22:04:01 INFO streaming.StreamJob:  map 96%  reduce 10%
10/06/25 22:04:05 INFO streaming.StreamJob:  map 97%  reduce 13%
10/06/25 22:04:10 INFO streaming.StreamJob:  map 98%  reduce 15%
10/06/25 22:04:15 INFO streaming.StreamJob:  map 99%  reduce 16%
10/06/25 22:04:20 INFO streaming.StreamJob:  map 99%  reduce 18%
10/06/25 22:04:21 INFO streaming.StreamJob:  map 100%  reduce 18%
10/06/25 22:04:25 INFO streaming.StreamJob:  map 100%  reduce 19%
10/06/25 22:04:30 INFO streaming.StreamJob:  map 100%  reduce 20%
10/06/25 22:04:35 INFO streaming.StreamJob:  map 100%  reduce 22%
10/06/25 22:04:45 INFO streaming.StreamJob:  map 100%  reduce 23%
10/06/25 22:04:50 INFO streaming.StreamJob:  map 100%  reduce 25%
10/06/25 22:04:55 INFO streaming.StreamJob:  map 100%  reduce 26%
10/06/25 22:05:00 INFO streaming.StreamJob:  map 100%  reduce 28%
10/06/25 22:05:05 INFO streaming.StreamJob:  map 100%  reduce 29%
10/06/25 22:05:10 INFO streaming.StreamJob:  map 100%  reduce 30%
10/06/25 22:05:15 INFO streaming.StreamJob:  map 100%  reduce 31%
10/06/25 22:05:20 INFO streaming.StreamJob:  map 100%  reduce 32%
10/06/25 22:05:30 INFO streaming.StreamJob:  map 100%  reduce 33%
10/06/25 22:07:41 INFO streaming.StreamJob:  map 100%  reduce 67%
10/06/25 22:31:52 INFO streaming.StreamJob:  map 100%  reduce 100%
10/06/25 22:31:52 INFO streaming.StreamJob: Job complete: job_201006252042_0006
10/06/25 22:31:52 INFO streaming.StreamJob: Output: text-out

real    29m31.166s
user    0m1.060s
sys     0m0.216s

Mime
View raw message