Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of paul.konyves@gmail.com
 designates 209.85.219.49 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <C232D8C9-7A72-4EFA-9F27-456126FCF85C@gmail.com>
References: 
 <CANVO1E0KXoJOw33SOxcxnjwJLg1Z+B4=fk9sCn_ZTysqKnaL+A@mail.gmail.com>
 <C232D8C9-7A72-4EFA-9F27-456126FCF85C@gmail.com>
From: Pal Konyves <paul.konyves@gmail.com>
Date: Sat, 20 Apr 2013 22:11:24 +0200
Message-ID: 
 <CANVO1E1ZuEb8LnsHh8Cq+w0RVACvHX0DkFNLvxyujB70yhLgCw@mail.gmail.com>
Subject: Re: default region splitting on which value?
To: user <user@hbase.apache.org>
Content-Type: multipart/alternative; boundary=089e01184d72ebadcb04dad07141

--089e01184d72ebadcb04dad07141
Content-Type: text/plain; charset=UTF-8

Hi Ted,
Only one family, my data is very simple key-value, although I want to make
sequential scan, so making a hash of the key is not an option.


On Sat, Apr 20, 2013 at 10:07 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> How many column families do you have ?
>
> For #3, per-splitting table at the row keys corresponding to peaks makes
> sense.
>
> On Apr 20, 2013, at 10:52 AM, Pal Konyves <paul.konyves@gmail.com> wrote:
>
> > Hi,
> >
> > I am just reading about region splitting. By default - as I understand -
> > Hbase handles splitting the regions. I just don't know how to imagine on
> > which key it splits the regions.
> >
> > 1) For example when I write MD5 hash of rowkeys, they are most probably
> > evenly distributed from
> > 000000... to FFFFF... right? When  Hbase starts with one region, all the
> > writes goes into that region, and when the HFile get's too big, it just
> > gets for example the median value of the stored keys, and split the
> region
> > by this?
> >
> > 2) I want to bulk load tons of data with the HBase java client API put
> > operations. I want it to perform well. My keys are numeric sequential
> > values (which I know from this post, I cannot load into Hbase
> sequentially,
> > because the Hbase tables are going to be sad
> >
> http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
> > )
> > So I thought I would pre-split the table into regions, and load the data
> > randomized. This way I will get good distribution among region servers in
> > terms of network IO from the beginning. Is that a good idea?
> >
> > 3) If my rowkeys are not evenly distributed in the keyspace, but they
> show
> > some peaks or bursts. e.g. 000-999, but most of the keys gather around
> 020
> > and 060 values, is it a good idea to have the pre region splits at those
> > peaks?
> >
> > Thanks in advance,
> > Pal
>

--089e01184d72ebadcb04dad07141--