hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1639) Grouping using hashing instead of sorting
Date Sun, 28 Mar 2010 18:52:27 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850714#action_12850714

Joydeep Sen Sarma commented on MAPREDUCE-1639:

@Luke - like this line of thinking. I am familiar with Hive - it already does map side partial
aggregates for (most) group-bys and emits them (what you mentioned). However - for (most)
joins and cluster-by - it depends on the sorting provided by the map-reduce framework. We
could alter the these query plans to execute this algorithm. the downside is pretty obvious:
the benefits would only be available to Hive users (even as the cost of implementation remains
fairly high). it would not be reusable by Pig for example - nor streaming users, etc.

extrapolating ur idea a bit - we could try to implement this as a library - something that
intercepts the output of the mapper (and combiner) and the input to the reduce function (that
could be inserted into the data path via config settings). (There's already a configurable
output collector interface on the map output side. not sure about reducer input). We could
turn off map-side sorting in the map-reduce framework itself. the main concern i would have
is that it would seem too hard for these output collectors to implements spills (/merges)
to (/from) disk. Perhaps if that functionality can be extracted out of the map-reduce core
and provided as a library that these output collectors can use - then it's all feasible.

> Grouping using hashing instead of sorting
> -----------------------------------------
>                 Key: MAPREDUCE-1639
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1639
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Joydeep Sen Sarma
> most applications of map-reduce care about grouping and not sorting. Sorting is a (relatively
expensive) way to achieve grouping. In order to achieve just grouping - one can:
> - replace the sort on the Mappers with a HashTable - and maintain lists of key-values
against each hash-bucket.
> - key-value tuples inside each hash bucket are sorted - before spilling or sending to
Reducer. Anytime this is done - Combiner can be invoked.
> - HashTable is serialized by hash-bucketid. So merges (of either spills or Map Outputs)
works similar to today (at least there's no change in overall compute complexity of merge)
> Of course this hashtable has nothing to do with partitioning. it's just a replacement
for map-side sort.
> --
> this is (pretty much) straight from the MARS project paper: http://www.cse.ust.hk/catalac/papers/mars_pact08.pdf.
They report a 45% speedup in inverted index calculation using hashing instead of sorting (reference
implementation is NOT against Hadoop though).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message