hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Map reduce classes
Date Wed, 16 Apr 2008 20:45:00 GMT

The easiest solution is to not worry too much about running an extra MR
step.

So,

- run a first pass to get the counts.  Use word count as the pattern.  Store
the results in a file.

- run the second pass.  You can now read the hash-table from the file you
stored in pass 1.

Another approach is to do the counting in your maps as specified and then
before exiting, you can emit special records for each key to suppress.  With
the correct sort and partition functions, you can make these killer records
appear first in the reduce input.  Then, if your reducer sees the kill flag
in the front of the values, it can avoid processing any extra data.

In general, it is better to not try to communicate between map and reduce
except via the expected mechanisms.
  


On 4/16/08 1:33 PM, "Aayush Garg" <aayush.garg@gmail.com> wrote:

> We can not read HashMap in the configure method of the reducer because it is
> called before reduce job.
> I need to eliminate rows from the HashMap when all the keys are read.
> Also my concern is if dataset is large will this HashMap thing work??
> 
> 
> On Wed, Apr 16, 2008 at 10:07 PM, Ted Dunning <tdunning@veoh.com> wrote:
> 
>> 
>> That design is fine.
>> 
>> You should read your map in the configure method of the reducer.
>> 
>> There is a MapFile format supported by Hadoop, but they tend to be pretty
>> slow.  I usually find it better to just load my hash table by hand.  If
>> you
>> do this, you should use whatever format you like.
>> 
>> 
>> On 4/16/08 12:41 PM, "Aayush Garg" <aayush.garg@gmail.com> wrote:
>> 
>>> HI,
>>> 
>>> The current structure of my program is::
>>> Upper class{
>>> class Reduce{
>>>   reduce function(K1,V1,K2,V2){
>>>         // I count the frequency for each key
>>>      // Add output in  HashMap(Key,value)  instead  of  output.collect()
>>>    }
>>>  }
>>> 
>>> void run()
>>>  {
>>>       runjob();
>>>      // Now eliminate top frequency keys in HashMap built in reduce
>> function
>>> here because only now hashmap is complete.
>>>      // Write this hashmap to a file in such a format so that I can use
>> this
>>> hashmap in next MapReduce job and key of this hashmap is taken as key in
>>> mapper function of that Map Reduce. ?? How and which format should I
>>> choose??? Is this design and approach ok?
>>> 
>>>   }
>>> 
>>>   public static void main() {}
>>> }
>>> I hope you have got my question.
>>> 
>>> Thanks,
>>> 
>>> 
>>> On Wed, Apr 16, 2008 at 8:33 AM, Amar Kamat <amarrk@yahoo-inc.com>
>> wrote:
>>> 
>>>> Aayush Garg wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Are you sure that another MR is required for eliminating some rows?
>>>>> Can't I
>>>>> just somehow eliminate from main() when I know the keys which are
>> needed
>>>>> to
>>>>> remove?
>>>>> 
>>>>> 
>>>>> 
>>>> Can you provide some more details on how exactly are you filtering?
>>>> Amar
>>>> 
>>>> 
>>>> 
>> 
>> 
> 


Mime
View raw message