hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: Row Keys
Date Sun, 30 Jan 2011 05:50:16 GMT
Hey,

So variable length keys and lexographical sorting makes it a little
tricky to do Scans and get exactly what you want.  This has a lot to
do with the ascii table too, and the numerical values.  Let consult
(http://www.asciitable.com/) while we work this example through:

Take a separation character of | as your code uses.  This is decimal
124, placing it way above both the lower and upper case letters AND
numbers, that is good.

Now you have something like this:

1234|a_string
1234|other_string

now we want to find all rows "belonging to" 1234, so we do a start row
of '1234|', but what for the end key? Well, let's try... '1234}', that
might work, oh wait, here is another key:

12345|foo

ok so '5' < '|' so it should short like so:
1234|a_string
1234|other_string
12345|foo

hmm well how does our end row compare? well '5' < '}' so '1234}' is
still "larger" than '12345|foo' so that row would be incorrectly
included in the scan results assuming we only want '1234' related
rows.

Ok, well maybe a better solution is to pick a lower ascii?  Well
outside of the control characters, space is the lowest character at
32, 33 is '!' so perhaps ! would be a better choice.  So you could
choose an end double quote as in '1234"' to define your 'stop row'.
Now you would be prohibited from using any character smaller than '33'
in your strings, which is kind of a non ideal solution.

This is all pretty clumsy, and doesnt work great in these variable
length separated strings.

The ultimate solution is to use the PrefixFilter, which is configured as such:
byte[] start_row = Bytes.toBytes("1234|");
Scan s = new Scan(start_row);
s.setFilter(new PrefixFilter(start_row));
// do scan.

that way no matter what sortability your separator is, you will get
the answer you want every time.



Another way to do compound keys is to go pure-binary.  For example I
want a key that is 2 integers, so I can do this:
int part1 = ... ;
int part2 = ... ;
byte[] row_key = Bytes.add(Bytes.toBytes(part1), Bytes.toBytes(part2));

Now you can also search for all rows starting with 'target' like such:
int target = ... ;
// start key is 'target', stop key is 'target+1'
Scan s = new Scan(Bytes.toBytes(target), Bytes.toBytes(target+1));

And you get exactly what you want, nothing more or less (all rows
starting with 'target').

The lexicographic comparison is very tricky sometimes. One quick tip
is that if your numbers (longs, ints) are big endian encoded (all the
utilities in Bytes.java do so), then the lexicographic sorting is
equal to the numeric sorting.  Otherwise if you do strings you end up
with:
1
11
2
3

and things are 'out of order'... if that is important, you can pad it
with 0s - dont forget to use the proper amount, which is 10 digits for
ints, and 19 for longs.  Or consider using binary encoding as above.

-ryan

On Sat, Jan 29, 2011 at 12:50 AM, Tatsuya Kawano <tatsuya6502@gmail.com> wrote:
> Hi Pete,
>
> You're right. If you use random keys, you will never know the start /
> end keys for scan. What you really want to do is to deign the key that
> will distribute well for writes but also has the certain locality for
> scans.
>
> You probably have the ideal key already (ID|Date). If you don't make
> entire key to be random but just the ID part, you could get a good
> distribution at write time because writes for different IDs will be
> distributed across the regions, and you also could get a good scan
> performance when you scan between certain dates for a specific ID
> because rows for the ID will be stored together in one region.
>
> Thanks,
> Tatsuya
>
>
> 2011/1/29 Peter Haidinyak <phaidinyak@local.com>:
>> I know they are always sorted but if they are how do you know which row key belong
to which data? Currently I use a row key of ID|Date so I always know what the startrow and
endrow should be. I know I'm missing something really fundamental here. :-(
>>
>> Thanks
>>
>> -Pete
>>
>> -----Original Message-----
>> From: tsuna [mailto:tsunanet@gmail.com]
>> Sent: Friday, January 28, 2011 12:14 PM
>> To: user@hbase.apache.org
>> Subject: Re: Row Keys
>>
>> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <phaidinyak@local.com> wrote:
>>>        This is going to seem like a dumb question but it is recommended that
you use a random key to spread the insert/read load among your region servers. My question
is if I am using a scan with startrow and endrow  how does that work with random row keys?
>>
>> The keys are always sorted.  So if you generate random keys, you'll
>> get your data back in a random order.
>> What is recommended depends on the specific problem you're trying to
>> solve.  But generally, one of the strengths of HBase is that the rows
>> are sorted, so sequential scanning is efficient (thanks to data
>> locality).
>>
>> --
>> Benoit "tsuna" Sigoure
>> Software Engineer @ www.StumbleUpon.com
>>
>
>
>
> --
> 河野 達也
> Tatsuya Kawano (Mr.)
> Tokyo, Japan
>
> twitter: http://twitter.com/tatsuya6502
>

Mime
View raw message