hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-1639) Grouping using hashing instead of sorting
Date Thu, 20 Oct 2011 01:14:10 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131246#comment-13131246
] 

He Yongqiang commented on MAPREDUCE-1639:
-----------------------------------------

yeah, actually we tried that with the native comparator code from Baidu (see patches for HCE),
and the difference is not very big, sometimes is worse (maybe because of some jni cost, but
havn't look into it). 
                
> Grouping using hashing instead of sorting
> -----------------------------------------
>
>                 Key: MAPREDUCE-1639
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1639
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Joydeep Sen Sarma
>
> most applications of map-reduce care about grouping and not sorting. Sorting is a (relatively
expensive) way to achieve grouping. In order to achieve just grouping - one can:
> - replace the sort on the Mappers with a HashTable - and maintain lists of key-values
against each hash-bucket.
> - key-value tuples inside each hash bucket are sorted - before spilling or sending to
Reducer. Anytime this is done - Combiner can be invoked.
> - HashTable is serialized by hash-bucketid. So merges (of either spills or Map Outputs)
works similar to today (at least there's no change in overall compute complexity of merge)
> Of course this hashtable has nothing to do with partitioning. it's just a replacement
for map-side sort.
> --
> this is (pretty much) straight from the MARS project paper: http://www.cse.ust.hk/catalac/papers/mars_pact08.pdf.
They report a 45% speedup in inverted index calculation using hashing instead of sorting (reference
implementation is NOT against Hadoop though).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message