hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Multiple Mappers and One Reducer
Date Wed, 07 Sep 2011 11:19:54 GMT
Praveenesh,

The JIRA https://issues.apache.org/jira/browse/MAPREDUCE-369
introduced it and carries a patch that I think would apply without
much trouble on your cluster's sources. You can mail me directly if
you need help applying a patch.

Alternatively, you can do something like downloading 0.21 where is is
found, and then pulling out the particular source files and adding
them to your project's source trees with their license and package
names intact (which I think is a legal requirement? others can correct
me if I'm wrong), and then you can utilize it as a regular import.

HTH.

On Wed, Sep 7, 2011 at 3:34 PM, praveenesh kumar <praveenesh@gmail.com> wrote:
> Harsh, Can you please tell how can we use MultipleInputs using Job Object on
> hadoop 0.20.2. As you can see, in MultipleInputs, its using JobConf object.
> I want to use Job object as mentioned in new hadoop 0.21 API.
> I remember you talked about pulling out things from new API and add it into
> out project.
> Can you please add more light how can we do this ?
>
> Thanks ,
> Praveenesh.
>
> On Wed, Sep 7, 2011 at 2:57 AM, Harsh J <harsh@cloudera.com> wrote:
>>
>> Sahana,
>>
>> Yes this is possible as well. Please take a look at the MultipleInputs
>> API @
>> http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/MultipleInputs.html
>>
>> It will allow you to add a path each with its own mapper
>> implementation, and you can then have a common reducer since the key
>> is what you'll be matching against.
>>
>> On Wed, Sep 7, 2011 at 3:02 PM, Sahana Bhat <sana.bhat@gmail.com> wrote:
>> > Hi,
>> >         I understand that given a file, the file is split across 'n'
>> > mapper
>> > instances, which is the normal case.
>> > The scenario i have is :
>> > 1. Two files which are not totally identical in terms of number of
>> > columns
>> > (but have data that is similar in a few columns) need to be processed
>> > and
>> > after computation a single output file has to be generated.
>> > Note : CV - computedvalue
>> > File1 belonging to one dataset has data for :
>> > Date,counter1,counter2, CV1,CV2
>> > File2 belonging to another dataset has data for :
>> > Date,counter1,counter2,CV3,CV4,CV5
>> > Computation to be carried out on these two files is :
>> > CV6 =(CV1*CV5)/100
>> > And the final emitted output file should have data in the sequence:
>> > Date,counter1,counter2,CV6
>> > The idea is to have two mappers (not instances) run on each of the file,
>> > and
>> > a single reducer that emits the final result file.
>> > Thanks,
>> > Sahana
>> > On Wed, Sep 7, 2011 at 2:40 PM, Harsh J <harsh@cloudera.com> wrote:
>> >>
>> >> Sahana,
>> >>
>> >> Yes. But, isn't that how it is normally? What makes you question this
>> >> capability?
>> >>
>> >> On Wed, Sep 7, 2011 at 2:37 PM, Sahana Bhat <sana.bhat@gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> >          Is it possible to have multiple mappers  where each
mapper
>> >> > is
>> >> > operating on a different input file and whose result (which is a key
>> >> > value
>> >> > pair from different mappers) is processed by a single reducer?
>> >> > Regards,
>> >> > Sahana
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Mime
View raw message