hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tharindu Mathew <mcclou...@gmail.com>
Subject Re: Extension points available for data locality
Date Tue, 21 Aug 2012 18:40:15 GMT
On Tue, Aug 21, 2012 at 7:49 PM, Michael Segel <michael_segel@hotmail.com>wrote:

>
> On Aug 21, 2012, at 8:54 AM, Tharindu Mathew <mccloud35@gmail.com> wrote:
>
> Yes, Micheal. You are thinking along the right lines.
>
> I just want to understand the inner workings of this, so I can rule out
> guess work when it comes to making my implementation reliable.
>
> For example, if a node in the mysql cluster goes down and the failover
> node takes over, I want to make sure Hadoop picks the failover node to pull
> the data from and doesn't fail the job because the original node is
> unavailable.
>
> That could be problematic.
> I mean, just shooting from the hip now...  If you are running a job, and
> mid stream you lose connection to that shard, then your task will time out
> and fail. As it gets restarted it could be that you catch an exception that
> indicates the server is down and to then go to the backup.
>
> Again this code will be all yours so you would have to write this feature
> in.
>
That feels inefficient. I assume the FileInputFormat handles it much
efficiently that this. This is the reason I ask, whether I have to modify
the namenode, so that it inherently knows the replicated locations of my
data.

OTOH, based on the answers in this thread I assume through the InputFormat
API I can feed the available node dynamically, if a node goes down.

>
> Hence, my extensive questions on this matter. As you said, of course you
> need to have the meta data to know which node holds what. Let's assume that
> meta data is available.
>
> At a minimum, the metadata should be available. How else do you partition
> the data in the first place?
> Also your cluster's configuration data has to be available.
>
> HTH
>
> On Tue, Aug 21, 2012 at 6:58 PM, Michael Segel <michael_segel@hotmail.com>wrote:
>
>> Interesting....
>>
>> You have a cluster of MySQL which is a bit different from a single data
>> source.
>>
>> When you say data locality, you want to run the job you mean that you
>> want to launch your job and then have each mapper pull data from the local
>> shard.
>>
>> So you have a couple of issues.
>>
>> 1) You will need to set up Hadoop on the same cluster.
>> This is doable, you just have to account for the memory and disk on your
>> system.
>>
>> 2) You will need to look at the HTable Input Format class.  (What's the
>> difference between looking at a RS versus a shard?)
>>
>> 3) You will need to make sure that you have enough metadata to help
>> determine where your data is located.
>>
>>
>> Outside of that, its doable.
>> Right?
>>
>>
>> Note that since you're not running HBase, Hadoop is a bit more tolerant
>> of swapping, but not by much.
>>
>> Good luck.
>>
>> On Aug 21, 2012, at 7:44 AM, Tharindu Mathew <mccloud35@gmail.com> wrote:
>>
>> Dino, Feng,
>>
>> Thanks for the options, but I guess I need to do it myself.
>>
>> Harsh,
>>
>> What you said was the initial impression I got, but I thought I need to
>> do something more with the name node. Thanks for clearing that out.
>>
>> My guess is that this probably works by using getLocations and mapping
>> this location ip (or host) with the ip (or host) of the task tracker? Is
>> this correct?
>>
>>
>> On Tue, Aug 21, 2012 at 3:14 PM, feng lu <amuseme.lu@gmail.com> wrote:
>>
>>> Hi Tharindu
>>>
>>> May you can try the Gora,The Apache Gora open source framework provides
>>> an in-memory data model and persistence for big data. Gora supports
>>> persisting to column stores, key value stores, document stores and RDBMSs,
>>> and analyzing the data with extensive Apache Hadoop MapReduce support.
>>>
>>> Now it support MySQL in gora-sql model.
>>>
>>>  http://gora.apache.org/
>>>
>>>
>>> On Tue, Aug 21, 2012 at 5:39 PM, Harsh J <harsh@cloudera.com> wrote:
>>>
>>>> Tharindu,
>>>>
>>>> (Am assuming you've done enough research to know that there's benefit
>>>> in what you're attempting to do.)
>>>>
>>>> Locality of tasks are determined by the job's InputFormat class.
>>>> Specifically, the locality information returned by the InputSplit
>>>> objects via InputFormat#getSplits(…) API is what the MR scheduler
>>>> looks at when trying to launch data local tasks.
>>>>
>>>> You can tweak your InputFormat (the one that uses this DB as input?)
>>>> to return relevant locations based on your "DB Cluster", in order to
>>>> achieve this.
>>>>
>>>> On Tue, Aug 21, 2012 at 2:36 PM, Tharindu Mathew <mccloud35@gmail.com>
>>>> wrote:
>>>> > Hi,
>>>> >
>>>> > I'm doing some research that involves pulling data stored in a mysql
>>>> cluster
>>>> > directly for a map reduce job, without storing the data in HDFS.
>>>> >
>>>> > I'd like to run hadoop task tracker nodes directly on the mysql
>>>> cluster
>>>> > nodes. The purpose of this being, starting mappers directly in the
>>>> node
>>>> > closest to the data if possible (data locality).
>>>> >
>>>> > I notice that with HDFS, since the name node knows exactly where each
>>>> data
>>>> > block is, it uses this to achieve data locality.
>>>> >
>>>> > Is there a way to achieve my requirement possibly by extending the
>>>> name node
>>>> > or otherwise?
>>>> >
>>>> > Thanks in advance.
>>>> >
>>>> > --
>>>> > Regards,
>>>> >
>>>> > Tharindu
>>>> >
>>>> > blog: http://mackiemathew.com/
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>
>>>
>>>
>>> --
>>> Don't Grow Old, Grow Up... :-)
>>>
>>
>>
>>
>> --
>> Regards,
>>
>> Tharindu
>>
>> blog: http://mackiemathew.com/
>>
>>
>>
>
>
> --
> Regards,
>
> Tharindu
>
> blog: http://mackiemathew.com/
>
>
>


-- 
Regards,

Tharindu

blog: http://mackiemathew.com/

Mime
View raw message