hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthik Kambatla (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-199) Locality hints for Reduce
Date Mon, 10 Sep 2012 18:48:09 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452299#comment-13452299

Karthik Kambatla commented on MAPREDUCE-199:

This might not be the use case Harsh was thinking of, but here is a use case from my summer
internship a couple of years ago:

Our use case: We were building a topic-based pub/sub system. The published events were in
one HBase table, and the subscriptions were in another table. While the published events were
stored by their published time-stamp, the subscriptions were stored by <Topic ID: Subscription
ID> as the key. Matching the published events to subscriptions required a join of the two
tables on the topic.

Approach: The map phase reads all the published events and emits (topic, event) pairs. The
reduce's input essentially is all events for a topic - the reduce reads all the subscriptions
of that topic and matches. Now, it would save a lot of communication if the reduce (for topic
A) were scheduled on the same node that had the subscriptions for the same topic A. Hence,
the need for reduce  data-locality.

We achieved this data locality through ugly hacks to the JT to store HBase region (key-range):
host mapping and overloading the partitioner to push each <key, value> pair to appropriate
reducers. I don't remember the exact speedups, but it was quite significant. (if my memory
is not wrong ~2x) 
> Locality hints for Reduce
> -------------------------
>                 Key: MAPREDUCE-199
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-199
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: applicationmaster, mrv2
>            Reporter: Benjamin Reed
>            Assignee: Harsh J
>         Attachments: MAPREDUCE-199.patch, MAPREDUCE-199.patch
> It would be nice if we could add method to OutputFormat that would allow a job to indicate
where a reducer for a given partition should should run. This is similar to the getSplits()
method on InputFormat. In our application the reducer is using other data in addition to the
map outputs during processing and data accesses could be made more efficient if the JobTracker
scheduled the reducers to run on specific hosts.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message