hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amareshwari Sriramadasu <amar...@yahoo-inc.com>
Subject Re: Hadoop streaming performance: elements vs. vectors
Date Mon, 06 Apr 2009 04:17:52 GMT
You can add your jar to distributed cache and add it to classpath by 
passing it in configuration propery - "mapred.job.classpath.archives".

-Amareshwari
Peter Skomoroch wrote:
> If I need to use a custom streaming combiner jar in Hadoop 18.3, is there a
> way to add it to the classpath without the following patch?
>
> https://issues.apache.org/jira/browse/HADOOP-3570
>
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200809.mbox/%3C48CF78E3.10807@yahoo-inc.com%3E
>
> On Sat, Mar 28, 2009 at 2:28 PM, Peter Skomoroch
> <peter.skomoroch@gmail.com>wrote:
>
>   
>> Paco,
>>
>> Thanks, good ideas on the combiner.  I'm going to tweak things a bit as you
>> suggest and report back later...
>>
>> -Pete
>>
>>
>> On Sat, Mar 28, 2009 at 11:43 AM, Paco NATHAN <ceteri@gmail.com> wrote:
>>
>>     
>>> hi peter,
>>> thinking aloud on this -
>>>
>>> trade-offs may depend on:
>>>
>>>   * how much grouping would be possible (tracking a PDF would be
>>> interesting for metrics)
>>>   * locality of key/value pairs (distributed among mapper and reducer
>>> tasks)
>>>
>>> to that point, will there be much time spent in the shuffle?  if so,
>>> it's probably cheaper to shuffle/sort the grouped row vectors than the
>>> many small key,value pair
>>>
>>> in any case, when i had a similar situation on a large data set (2-3
>>> Tb shuffle) a good pattern to follow was:
>>>
>>>   * mapper emitted small key,value pairs
>>>   * combiner grouped into row vectors
>>>
>>> that combiner may get invoked both at the end of the map phase and at
>>> the beginning of the reduce phase (more benefit)
>>>
>>> also, using byte arrays if possible to represent values may be able to
>>> save much shuffle time
>>>
>>> best,
>>> paco
>>>
>>>
>>> On Sat, Mar 28, 2009 at 01:51, Peter Skomoroch
>>> <peter.skomoroch@gmail.com> wrote:
>>>       
>>>> Hadoop streaming question: If I am forming a matrix M by summing a
>>>>         
>>> number of
>>>       
>>>> elements generated on different mappers, is it better to emit tons of
>>>>         
>>> lines
>>>       
>>>> from the mappers with small key,value pairs for each element, or should
>>>>         
>>> I
>>>       
>>>> group them into row vectors before sending to the reducers?
>>>>
>>>> For example, say I'm summing frequency count matrices M for each user on
>>>>         
>>> a
>>>       
>>>> different map task, and the reducer combines the resulting sparse user
>>>>         
>>> count
>>>       
>>>> matrices for use in another calculation.
>>>>
>>>> Should I emit the individual elements:
>>>>
>>>> i (j, Mij) \n
>>>> 3 (1, 3.4) \n
>>>> 3 (2, 3.4) \n
>>>> 3 (3, 3.4) \n
>>>> 4 (1, 2.3) \n
>>>> 4 (2, 5.2) \n
>>>>
>>>> Or posting list style vectors?
>>>>
>>>> 3 ((1, 3.4), (2, 3.4), (3, 3.4)) \n
>>>> 4 ((1, 2.3), (2, 5.2)) \n
>>>>
>>>> Using vectors will at least save some message space, but are there any
>>>>         
>>> other
>>>       
>>>> benefits to this approach in terms of Hadoop streaming overhead (sorts
>>>> etc.)?  I think buffering issues will not be a huge concern since the
>>>>         
>>> length
>>>       
>>>> of the vectors have a reasonable upper bound and will be in a sparse
>>>> format...
>>>>
>>>>
>>>> --
>>>> Peter N. Skomoroch
>>>> 617.285.8348
>>>> http://www.datawrangling.com
>>>> http://delicious.com/pskomoroch
>>>> http://twitter.com/peteskomoroch
>>>>
>>>>         
>>
>> --
>> Peter N. Skomoroch
>> 617.285.8348
>> http://www.datawrangling.com
>> http://delicious.com/pskomoroch
>> http://twitter.com/peteskomoroch
>>
>>     
>
>
>
>   


Mime
View raw message