hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1639) Grouping using hashing instead of sorting
Date Sat, 27 Mar 2010 18:42:27 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850562#action_12850562

Joydeep Sen Sarma commented on MAPREDUCE-1639:

following up on the previous comment - reading the paper again closely - it's no longer clear
to me that the paper actually implements what i filed or whether it's closer to how Owen interpreted
it. But i think it's true that sorting only within hash buckets only is fundamentally lower
complexity than sorting everything together.


i am also wondering how different this is from current data path. today we anyway bucket map
outputs into one bucket per reducer and sort therein. this is simply saying that we have lots
of buckets per reducer (not one) - we are still sorting inside each of those - but because
we are sorting much smaller sets - the complexity is lowered. 

so on the mapper side - we can (sort of) just pretend that the number of reducers is much
much higher. Except when it's time to send off the data to the reducer (or spill) - we send
a bunch of buckets. but this is not transparent to the merge code - it now has to merge taking
the bucketid into consideration. if we prefixed each map-output key with the bucketid - then
it's transparent to the merge code. so definitely some changes required - but maybe tractable?

> Grouping using hashing instead of sorting
> -----------------------------------------
>                 Key: MAPREDUCE-1639
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1639
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Joydeep Sen Sarma
> most applications of map-reduce care about grouping and not sorting. Sorting is a (relatively
expensive) way to achieve grouping. In order to achieve just grouping - one can:
> - replace the sort on the Mappers with a HashTable - and maintain lists of key-values
against each hash-bucket.
> - key-value tuples inside each hash bucket are sorted - before spilling or sending to
Reducer. Anytime this is done - Combiner can be invoked.
> - HashTable is serialized by hash-bucketid. So merges (of either spills or Map Outputs)
works similar to today (at least there's no change in overall compute complexity of merge)
> Of course this hashtable has nothing to do with partitioning. it's just a replacement
for map-side sort.
> --
> this is (pretty much) straight from the MARS project paper: http://www.cse.ust.hk/catalac/papers/mars_pact08.pdf.
They report a 45% speedup in inverted index calculation using hashing instead of sorting (reference
implementation is NOT against Hadoop though).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message