hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Map reduce classes
Date Thu, 17 Apr 2008 15:54:48 GMT

Don't assume that any variables are shared between reducers or between maps,
or between maps and reducers.

If you want to share data, put it into HDFS.


On 4/17/08 4:01 AM, "Aayush Garg" <aayush.garg@gmail.com> wrote:

> One more thing:::
> The HashMap that I am generating in the reduce phase will be on single node
> or multiple nodes in the distributed enviornment? If my dataset is large
> will this approach work? If not what can I do for this?
> Also same thing with the file that I am writing in the run function (simple
> file opening FileStream) ??
> 
> 
> 
> On Thu, Apr 17, 2008 at 6:04 AM, Amar Kamat <amarrk@yahoo-inc.com> wrote:
> 
>> Ted Dunning wrote:
>> 
>>> The easiest solution is to not worry too much about running an extra MR
>>> step.
>>> 
>>> So,
>>> 
>>> - run a first pass to get the counts.  Use word count as the pattern.
>>>  Store
>>> the results in a file.
>>> 
>>> - run the second pass.  You can now read the hash-table from the file
>>> you
>>> stored in pass 1.
>>> 
>>> Another approach is to do the counting in your maps as specified and
>>> then
>>> before exiting, you can emit special records for each key to suppress.
>>>  With
>>> the correct sort and partition functions, you can make these killer
>>> records
>>> appear first in the reduce input.  Then, if your reducer sees the kill
>>> flag
>>> in the front of the values, it can avoid processing any extra data.
>>> 
>>> 
>>> 
>> Ted,
>> Will this work for the case where the cutoff frequency/count requires a
>> global picture? I guess not.
>> 
>>  In general, it is better to not try to communicate between map and reduce
>>> except via the expected mechanisms.
>>> 
>>> 
>>> On 4/16/08 1:33 PM, "Aayush Garg" <aayush.garg@gmail.com> wrote:
>>> 
>>> 
>>> 
>>>> We can not read HashMap in the configure method of the reducer because
>>>> it is
>>>> called before reduce job.
>>>> I need to eliminate rows from the HashMap when all the keys are read.
>>>> Also my concern is if dataset is large will this HashMap thing work??
>>>> 
>>>> 
>>>> On Wed, Apr 16, 2008 at 10:07 PM, Ted Dunning <tdunning@veoh.com>
>>>> wrote:
>>>> 
>>>> 
>>>> 
>>>>> That design is fine.
>>>>> 
>>>>> You should read your map in the configure method of the reducer.
>>>>> 
>>>>> There is a MapFile format supported by Hadoop, but they tend to be
>>>>> pretty
>>>>> slow.  I usually find it better to just load my hash table by hand.
>>>>>  If
>>>>> you
>>>>> do this, you should use whatever format you like.
>>>>> 
>>>>> 
>>>>> On 4/16/08 12:41 PM, "Aayush Garg" <aayush.garg@gmail.com> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>>> HI,
>>>>>> 
>>>>>> The current structure of my program is::
>>>>>> Upper class{
>>>>>> class Reduce{
>>>>>>  reduce function(K1,V1,K2,V2){
>>>>>>        // I count the frequency for each key
>>>>>>     // Add output in  HashMap(Key,value)  instead  of
>>>>>>  output.collect()
>>>>>>   }
>>>>>>  }
>>>>>> 
>>>>>> void run()
>>>>>>  {
>>>>>>      runjob();
>>>>>>     // Now eliminate top frequency keys in HashMap built in reduce
>>>>>> 
>>>>>> 
>>>>> function
>>>>> 
>>>>> 
>>>>>> here because only now hashmap is complete.
>>>>>>     // Write this hashmap to a file in such a format so that I can
>>>>>> use
>>>>>> 
>>>>>> 
>>>>> this
>>>>> 
>>>>> 
>>>>>> hashmap in next MapReduce job and key of this hashmap is taken as
>>>>>> key in
>>>>>> mapper function of that Map Reduce. ?? How and which format should
>>>>>> I
>>>>>> choose??? Is this design and approach ok?
>>>>>> 
>>>>>>  }
>>>>>> 
>>>>>>  public static void main() {}
>>>>>> }
>>>>>> I hope you have got my question.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> 
>>>>>> On Wed, Apr 16, 2008 at 8:33 AM, Amar Kamat <amarrk@yahoo-inc.com>
>>>>>> 
>>>>>> 
>>>>> wrote:
>>>>> 
>>>>> 
>>>>>> Aayush Garg wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Are you sure that another MR is required for eliminating
some
>>>>>>>> rows?
>>>>>>>> Can't I
>>>>>>>> just somehow eliminate from main() when I know the keys which
>>>>>>>> are
>>>>>>>> 
>>>>>>>> 
>>>>>>> needed
>>>>> 
>>>>> 
>>>>>> to
>>>>>>>> remove?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> Can you provide some more details on how exactly are you
>>>>>>> filtering?
>>>>>>> Amar
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 


Mime
View raw message