hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Extension points available for data locality
Date Tue, 21 Aug 2012 09:39:44 GMT

(Am assuming you've done enough research to know that there's benefit
in what you're attempting to do.)

Locality of tasks are determined by the job's InputFormat class.
Specifically, the locality information returned by the InputSplit
objects via InputFormat#getSplits(…) API is what the MR scheduler
looks at when trying to launch data local tasks.

You can tweak your InputFormat (the one that uses this DB as input?)
to return relevant locations based on your "DB Cluster", in order to
achieve this.

On Tue, Aug 21, 2012 at 2:36 PM, Tharindu Mathew <mccloud35@gmail.com> wrote:
> Hi,
> I'm doing some research that involves pulling data stored in a mysql cluster
> directly for a map reduce job, without storing the data in HDFS.
> I'd like to run hadoop task tracker nodes directly on the mysql cluster
> nodes. The purpose of this being, starting mappers directly in the node
> closest to the data if possible (data locality).
> I notice that with HDFS, since the name node knows exactly where each data
> block is, it uses this to achieve data locality.
> Is there a way to achieve my requirement possibly by extending the name node
> or otherwise?
> Thanks in advance.
> --
> Regards,
> Tharindu
> blog: http://mackiemathew.com/

Harsh J

View raw message