hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From arnaud but <sdnetw...@gmail.com>
Subject Re: hbase map/reduce questions
Date Mon, 09 Apr 2012 22:06:20 GMT
thank you very much, i will take a look at these links but i think that 
i understand in fact I did not know the getlocation roles in the 
distrubtion of the map task.

Le 09/04/2012 19:45, Suraj Varma a écrit :
> Take a look at InputSplit:
> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/InputSplit.java#InputSplit.getLocations%28%29
>
> Then take a look at how TableSplit is implemented (getLocations method
> in particular):
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.90.5/org/apache/hadoop/hbase/mapreduce/TableSplit.java#TableSplit.getLocations%28%29
>
> Also look at TableInputFormatBase#getSplits method to see how the
> region locations are populated.
>
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.90.4/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java#TableInputFormatBase.getSplits%28org.apache.hadoop.hbase.mapreduce.JobContext%29
>
> In your case, if you want to run your maps on all available nodes
> regardless of the fact that only two of those nodes contain your
> regions ... you would implement a custom InputSplit that returns an
> empty String[] in the getLocations() method.
> --Suraj
>
> On Mon, Apr 9, 2012 at 1:29 AM, arnaud but<sdnetwork@gmail.com>  wrote:
>> ok thanks,
>>
>>
>>> Yes - if you do a custom split, and have sufficient map slots in your
>>> cluster
>>
>> if I understand well even if the lines are stored on only two nodes of my
>> luster I can distribute the "map tasks" on the other nodes?
>>
>> eg
>> i have 10 nodes in the cluster i done a custom split that split every 100
>> rows.
>> All rows are stored on only two nodes, my map/reduce task generate 10 map
>> task because i have 1000 rows.
>> is that all nodes will receive a map task has executed ? or only the two
>> nodes where is stored the 1000 rows.
>>
>>
>>> you can parallelize the map tasks to run on other nodes as
>>> well
>>
>> How i can do that ? i do not see how i can say this split Will Be execute on
>> this node programmatically?
>>
>> Le 08/04/2012 18:37, Suraj Varma a écrit :
>>
>>>> if i do a custom input that split the table by 100 rows, can i
>>>> distribute manually each part  on a node   regardless where the data
>>>> is ?
>>>
>>>
>>> Yes - if you do a custom split, and have sufficient map slots in your
>>> cluster, you can parallelize the map tasks to run on other nodes as
>>> well. But if you are using HBase as the sink / source, these map tasks
>>> will still reach back to the region server node holding that row. So -
>>> if you have all your rows in two nodes, all the map tasks will still
>>> reach out to those two nodes. Depending on what your map tasks are
>>> doing (intensive crunching vs I/O) this may or may not help with what
>>> you are doing.
>>> --Suraj
>>>
>>>
>>>
>>> On Thu, Apr 5, 2012 at 6:44 AM, Arnaud Le-roy<sdnetwork@gmail.com>    wrote:
>>>>
>>>> yes i know but it's just an exemple we can do the same exemple with
>>>> one billion but effectivelly you could say me in this case the rows
>>>> would be stored on all node.
>>>>
>>>> maybe it's not possible to distributed manually the task through the
>>>> cluster ?
>>>> and maybe it's not a good idea but  I would like to know in order to
>>>> make the best schema for my data.
>>>>
>>>> Le 5 avril 2012 15:08, Doug Meil<doug.meil@explorysmedical.com>   
a écrit
>>>> :
>>>>>
>>>>>
>>>>> If you only have 1000 rows, why use MapReduce?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 4/5/12 6:37 AM, "Arnaud Le-roy"<sdnetwork@gmail.com>    wrote:
>>>>>
>>>>>> but do you think that i can change the default behavior ?
>>>>>>
>>>>>> for exemple i have ten nodes in my cluster and my table is stored
only
>>>>>> on two nodes this table have 1000 rows.
>>>>>> with the default behavior only two nodes will work for a map/reduce
>>>>>> task., isn't it ?
>>>>>>
>>>>>> if i do a custom input that split the table by 100 rows, can i
>>>>>> distribute manually each part  on a node   regardless where the data
>>>>>> is ?
>>>>>>
>>>>>> Le 5 avril 2012 00:36, Doug Meil<doug.meil@explorysmedical.com>
   a
>>>>>> écrit :
>>>>>>>
>>>>>>>
>>>>>>> The default behavior is that the input splits are where the data
is
>>>>>>> stored.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 4/4/12 5:24 PM, "sdnetwork"<sdnetwork@gmail.com>   
wrote:
>>>>>>>
>>>>>>>> ok thanks,
>>>>>>>>
>>>>>>>> but i don't find the information that tell me how the result
of the
>>>>>>>> split
>>>>>>>> is
>>>>>>>> distrubuted across the different node of the cluster ?
>>>>>>>>
>>>>>>>> 1) randomely ?
>>>>>>>> 2) where the data is stored ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>
>>
>



Mime
View raw message