hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chen He <airb...@gmail.com>
Subject Re: more reduce tasks
Date Fri, 04 Jan 2013 05:32:05 GMT
Hi Bejoy

Thank you for your idea.

The hadoop patch I said means this merge happens during the output writing
process.

Regards!

Chen
On Jan 3, 2013 11:25 PM, <bejoy.hadoop@gmail.com> wrote:

> **
> Hi Chen,
>
> You do have an option in hadoop to achieve this if you want the merged
> file in LFS.
>
> 1) Run your job with n number of reducers. And you'll have n files in the
> output dir.
>
> 2) Issue a hadoop fs -getmerge command to get the files in output dir
> merged into a single file in LFS
> (In recent versions use 'hdfs dfs -getmerge')
>
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * Chen He <airbots@gmail.com>
> *Date: *Thu, 3 Jan 2013 22:55:36 -0600
> *To: *<user@hadoop.apache.org>
> *ReplyTo: * user@hadoop.apache.org
> *Subject: *Re: more reduce tasks
>
> Sounds like you want more reducer to reduce the execution time but only
> want a single output file.
>
> Is this waht you want?
>
> You can use as many as your want (may not be optimal) reducers when you
> are running your reducer. Once the program is done, write a small perl,
> python, or shell program connect those part-* files.
>
> if you do not want to write your own script to connect those files and let
> Hadoop automatically generate a single file.
>
> It may need some patched to current Hadoop. I am not sure they are ready
> or not.
>
> On Thu, Jan 3, 2013 at 10:45 PM, Vinod Kumar Vavilapalli <
> vinodkv@hortonworks.com> wrote:
>
>>
>> Is it that you want the parallelism but a single final output? Assuming
>> your first job's reducers generate a small output, another stage is the way
>> to go. If not, second stage won't help. What exactly are your objectives?
>>
>>  Thanks,
>> +Vinod
>>
>> On Jan 3, 2013, at 1:11 PM, Pavel Hančar wrote:
>>
>>   Hello,
>> I'd like to use more than one reduce task with Hadoop Streaming and I'd
>> like to have only one result. Is it possible? Or should I run one more job
>> to merge the result? And is it the same with non-streaming jobs? Below you
>> see, I have 5 results for mapred.reduce.tasks=5.
>>
>> $ hadoop jar
>> /packages/run.64/hadoop-0.20.2-cdh3u1/contrib/streaming/hadoop-streaming-0.20.2-cdh3u1.jar
>> -D mapred.reduce.tasks=5 -mapper /bin/cat -reducer /tmp/wcc -file /tmp/wcc
>> -file /bin/cat -input /user/hadoopnlp/1gb -output 1gb.wc
>> .
>> .
>> .
>> 13/01/03 22:00:03 INFO streaming.StreamJob:  map 100%  reduce 100%
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Job complete:
>> job_201301021717_0038
>> 13/01/03 22:00:07 INFO streaming.StreamJob: Output: 1gb.wc
>> $ hadoop dfs -cat 1gb.wc/part-*
>> 472173052
>> 165736187
>> 201719914
>> 184376668
>> 163872819
>> $
>>
>> where /tmp/wcc contains
>> #!/bin/bash
>> wc -c
>>
>> Thanks for any answer,
>>  Pavel Hančar
>>
>>
>>
>

Mime
View raw message