hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harsh J (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-199) Locality hints for Reduce
Date Mon, 10 Sep 2012 18:42:08 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452293#comment-13452293
] 

Harsh J commented on MAPREDUCE-199:
-----------------------------------

bq. Harsh - I'm not familiar with the HBase case; can you please add more colour?

Surely!

bq. In this case, won't it be sufficient to schedule maps on the RS? If the data is already
sorted, but would you try schedule reduces instead?

We have this concept of bulkloads, for example, in HBase, where the Maps read in data from
a raw source (such as a delimited text file) and passes it to a reducer (partitioned by TotalOrderPartitioner
based on the region distribution of the table in HBase). The sorted data is then written onto
a file on HDFS and later, injected into the /hbase directory structure for serving.

There's cheap gains (but gains nevertheless) if the data written by the reducer is local to
the RegionServer hosting that specific partition (region) itself, before we bulkload it in.

Likewise, if people have HBase jobs doing a reduce phase for whatever reason, and wish to
achieve locality such that the reducer task (which emit the keys) are local to the regionserver
serving the same region for those keys, they can do so via a pre-configured job.

There are some use-cases out of HBase as well (I'll let those who've desired this comment),
but maybe YARN can change those to be outside of MR today.

Or maybe HBase can get a custom AM to do their work in more efficient manner than the current
MR (MR is easy to use though) - in the long term.

I just think using YARN to write a new app for everything is a slightly longer path to take
if MR can be harmlessly tweaked a bit more to do the same thing along with the other good
things it already does.

bq. My concern adding apis/config is that it becomes part of the user interface and I'd like
to think through it's implications, and whether it's really necessary, before we commit to
it. Makes sense?

Yes, makes sense on the API side. Partly why I went with a simple config-based option on doing
this.
                
> Locality hints for Reduce
> -------------------------
>
>                 Key: MAPREDUCE-199
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-199
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: applicationmaster, mrv2
>            Reporter: Benjamin Reed
>            Assignee: Harsh J
>         Attachments: MAPREDUCE-199.patch, MAPREDUCE-199.patch
>
>
> It would be nice if we could add method to OutputFormat that would allow a job to indicate
where a reducer for a given partition should should run. This is similar to the getSplits()
method on InputFormat. In our application the reducer is using other data in addition to the
map outputs during processing and data accesses could be made more efficient if the JobTracker
scheduled the reducers to run on specific hosts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message