hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rochette <alexr...@yahoo-inc.com>
Subject Re: Map reduce classes
Date Wed, 16 Apr 2008 20:44:40 GMT
The hashmap solution won't scale very much as the output data has to fit 
completely in the heap space of a single machine.

You establish the threshold only after you're done with all the keys 
right? That's the reason you cannot do something like :
if (frequency < threshold)
    output.collect(...);

in the reducer?

If that's the case, doing a second simple map-reduce pass on your data 
to eliminate frequent keys is probably the most scalable solution.

alex.r.

Aayush Garg wrote:
> We can not read HashMap in the configure method of the reducer because it is
> called before reduce job.
> I need to eliminate rows from the HashMap when all the keys are read.
> Also my concern is if dataset is large will this HashMap thing work??
>
>
> On Wed, Apr 16, 2008 at 10:07 PM, Ted Dunning <tdunning@veoh.com> wrote:
>
>   
>> That design is fine.
>>
>> You should read your map in the configure method of the reducer.
>>
>> There is a MapFile format supported by Hadoop, but they tend to be pretty
>> slow.  I usually find it better to just load my hash table by hand.  If
>> you
>> do this, you should use whatever format you like.
>>
>>
>> On 4/16/08 12:41 PM, "Aayush Garg" <aayush.garg@gmail.com> wrote:
>>
>>     
>>> HI,
>>>
>>> The current structure of my program is::
>>> Upper class{
>>> class Reduce{
>>>   reduce function(K1,V1,K2,V2){
>>>         // I count the frequency for each key
>>>      // Add output in  HashMap(Key,value)  instead  of  output.collect()
>>>    }
>>>  }
>>>
>>> void run()
>>>  {
>>>       runjob();
>>>      // Now eliminate top frequency keys in HashMap built in reduce
>>>       
>> function
>>     
>>> here because only now hashmap is complete.
>>>      // Write this hashmap to a file in such a format so that I can use
>>>       
>> this
>>     
>>> hashmap in next MapReduce job and key of this hashmap is taken as key in
>>> mapper function of that Map Reduce. ?? How and which format should I
>>> choose??? Is this design and approach ok?
>>>
>>>   }
>>>
>>>   public static void main() {}
>>> }
>>> I hope you have got my question.
>>>
>>> Thanks,
>>>
>>>
>>> On Wed, Apr 16, 2008 at 8:33 AM, Amar Kamat <amarrk@yahoo-inc.com>
>>>       
>> wrote:
>>     
>>>> Aayush Garg wrote:
>>>>
>>>>         
>>>>> Hi,
>>>>>
>>>>> Are you sure that another MR is required for eliminating some rows?
>>>>> Can't I
>>>>> just somehow eliminate from main() when I know the keys which are
>>>>>           
>> needed
>>     
>>>>> to
>>>>> remove?
>>>>>
>>>>>
>>>>>
>>>>>           
>>>> Can you provide some more details on how exactly are you filtering?
>>>> Amar
>>>>
>>>>
>>>>
>>>>         
>>     
>
>
>   


Mime
View raw message