hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Splitting up an HBase Table into partitions
Date Tue, 17 Mar 2015 20:57:48 GMT
Ok… 

Lets take a step back. 

If you’re writing your own code and you’re writing a m/r program, you will get one split
per region. 
If your scan doesn’t contain a start or stop row, you will scan every row in the table.


The splits provide parallelism. 
So when you launch your job and you have 10 regions, you’ll have 10 splits. 

Going from memory, if your scan has a start/stop row, then those regions where there is no
data  (e.g. the region’s start row isn’t inside the scope of your scan) the mapper created
will complete quickly  and no rows are scanned and returned in the result set. 

I think what you’re looking for is already done for you. 

HTH

-Mike

> On Mar 17, 2015, at 2:09 PM, Gokul Balakrishnan <royalgok@gmail.com> wrote:
> 
> Hi Michael,
> 
> Thanks for the reply. Yes, I do realise that HBase has regions, perhaps my
> usage of the term partitions was misleading. What I'm looking for is
> exactly what you've mentioned - a means of creating splits based on
> regions, without having to iterate over all rows in the table through the
> client API. Do you have any idea how I might achieve this?
> 
> Thanks,
> 
> On Tuesday, March 17, 2015, Michael Segel <michael_segel@hotmail.com> wrote:
> 
>> Hbase doesn't have partitions.  It has regions.
>> 
>> The split occurs against the regions so that if you have n regions, you
>> have n splits.
>> 
>> Please don't confuse partitions and regions because they are not the same
>> or synonymous.
>> 
>>> On Mar 17, 2015, at 7:30 AM, Gokul Balakrishnan <royalgok@gmail.com
>> <javascript:;>> wrote:
>>> 
>>> Hi,
>>> 
>>> My requirement is to partition an HBase Table and return a group of
>> records
>>> (i.e. rows having a specific format) without having to iterate over all
>> of
>>> its rows. These partitions (which should ideally be along regions) will
>>> eventually be sent to Spark but rather than use the HBase or Hadoop RDDs
>>> directly, I'll be using a custom RDD which recognizes partitions as the
>>> aforementioned group of records.
>>> 
>>> I was looking at achieving this through creating InputSplits through
>>> TableInputFormat.getSplits(), as being done in the HBase RDD [1] but I
>>> can't figure out a way to do this without having access to the mapred
>>> context etc.
>>> 
>>> Would greatly appreciate if someone could point me in the right
>> direction.
>>> 
>>> [1]
>>> 
>> https://github.com/tmalaska/SparkOnHBase/blob/master/src/main/scala/com/cloudera/spark/hbase/HBaseScanRDD.scala
>>> 
>>> Thanks,
>>> Gokul
>> 
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>> 
>> 
>> 
>> 
>> 
>> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is
purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com






Mime
View raw message