hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Kumar Vavilapalli <vino...@hortonworks.com>
Subject Re: more reduce tasks
Date Fri, 04 Jan 2013 04:45:56 GMT

Is it that you want the parallelism but a single final output? Assuming your first job's reducers
generate a small output, another stage is the way to go. If not, second stage won't help.
What exactly are your objectives?

Thanks,
+Vinod

On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:

>   Hello,
> I'd like to use more than one reduce task with Hadoop Streaming and I'd like to have
only one result. Is it possible? Or should I run one more job to merge the result? And is
it the same with non-streaming jobs? Below you see, I have 5 results for mapred.reduce.tasks=5.
> 
> $ hadoop jar /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
 -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc -file /bin/cat
-input /user/hadoopnlp/1gb -output 1gb.wc
> .
> .
> .
> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete: job_201301021717_0038
> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
> $ hadoop dfs -cat 1gb.wc/part-*
> 472173052
> 165736187
> 201719914
> 184376668
> 163872819
> $
> 
> where /tmp/wcc contains
> #!/bin/bash
> wc -c
> 
> Thanks for any answer,
>  Pavel Hančar


Mime
View raw message