hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shahab Yunus <shahab.yu...@gmail.com>
Subject Re: Splitting up an HBase Table into partitions
Date Tue, 17 Mar 2015 19:16:40 GMT
If you know the row key range of your data, then you can create splits
points yourself and then use HBase api to actually make the splits.

E.g. If you know that your row key (and it is a very contrived example) has
a range of A - Z then you can decide on split points as every 5 th letter
as your split points and then use HBaseAdmin.split method to do the split
for you. This way you don't have to iterate of your data.

Or are you saying that you don't have the row key range?

Regards,
Shahab

On Tue, Mar 17, 2015 at 3:12 PM, Mikhail Antonov <olorinbant@gmail.com>
wrote:

> Not sure what do you mean by "a means of creating splits based on
> regions, without having to iterate over all rows in the table through the
> client API.". Could you elaborate?
>
> -Mikhail
>
> On Tue, Mar 17, 2015 at 12:09 PM, Gokul Balakrishnan <royalgok@gmail.com>
> wrote:
> > Hi Michael,
> >
> > Thanks for the reply. Yes, I do realise that HBase has regions, perhaps
> my
> > usage of the term partitions was misleading. What I'm looking for is
> > exactly what you've mentioned - a means of creating splits based on
> > regions, without having to iterate over all rows in the table through the
> > client API. Do you have any idea how I might achieve this?
> >
> > Thanks,
> >
> > On Tuesday, March 17, 2015, Michael Segel <michael_segel@hotmail.com>
> wrote:
> >
> >> Hbase doesn't have partitions.  It has regions.
> >>
> >> The split occurs against the regions so that if you have n regions, you
> >> have n splits.
> >>
> >> Please don't confuse partitions and regions because they are not the
> same
> >> or synonymous.
> >>
> >> > On Mar 17, 2015, at 7:30 AM, Gokul Balakrishnan <royalgok@gmail.com
> >> <javascript:;>> wrote:
> >> >
> >> > Hi,
> >> >
> >> > My requirement is to partition an HBase Table and return a group of
> >> records
> >> > (i.e. rows having a specific format) without having to iterate over
> all
> >> of
> >> > its rows. These partitions (which should ideally be along regions)
> will
> >> > eventually be sent to Spark but rather than use the HBase or Hadoop
> RDDs
> >> > directly, I'll be using a custom RDD which recognizes partitions as
> the
> >> > aforementioned group of records.
> >> >
> >> > I was looking at achieving this through creating InputSplits through
> >> > TableInputFormat.getSplits(), as being done in the HBase RDD [1] but I
> >> > can't figure out a way to do this without having access to the mapred
> >> > context etc.
> >> >
> >> > Would greatly appreciate if someone could point me in the right
> >> direction.
> >> >
> >> > [1]
> >> >
> >>
> https://github.com/tmalaska/SparkOnHBase/blob/master/src/main/scala/com/cloudera/spark/hbase/HBaseScanRDD.scala
> >> >
> >> > Thanks,
> >> > Gokul
> >>
> >> The opinions expressed here are mine, while they may reflect a cognitive
> >> thought, that is purely accidental.
> >> Use at your own risk.
> >> Michael Segel
> >> michael_segel (AT) hotmail.com
> >>
> >>
> >>
> >>
> >>
> >>
>
>
>
> --
> Thanks,
> Michael Antonov
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message