hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: Row Keys
Date Mon, 31 Jan 2011 06:10:07 GMT
Hey,

I don't understand the 'random scan' question... if you want to scan a
random key, just scan! For example:

byte [] random_key = generateRandomKeyUsingRandomNumberGenerator();
Scan s = new Scan(random_key);

But you must mean something else... perhaps you could illuminate me?

-ryan

On Sun, Jan 30, 2011 at 10:06 PM, Lars George <lars.george@gmail.com> wrote:
> Hi Pete,
>
> Look into the Mozilla Socorro project
> (http://code.google.com/p/socorro/) for how to "salt" the keys to get
> better load balancing across sequential keys. The principle is to add
> a salt, in this case a number reflecting the number of servers
> available (some multiple of that to allow for growth) and then prefix
> the sequential key with it so that writes are spread across all
> servers. For example "<salt>-<yyyyMMddhhmm>". When reading you need to
> open N scanners where N is the number of distinct salt values and scan
> each subset with them while eventually combining the result in client
> code. Assuming you want to scan all values in January and you have a
> salt 0, 1, 2, 3, 4, and 5 you have scanner for "0-201101010000" to
> "0-201102010000", then another for "1-201101010000" to
> "1-201102010000" and so on. Then do the scans (multithreaded for
> example) and combine the results client side. The Socorro code shows
> one way to implement this.
>
> Lars
>
>
> On Mon, Jan 31, 2011 at 6:20 AM, Pete Haidinyak <javamann@cox.net> wrote:
>> Sorry my id is '<date>|<id>' with date being in the format 'YYYY-MM-DD'
>> My start row is '<date>|<id>| ' (with a space ascii 32) and end row is
>> '<date>|<id>|~' (tilde character) and this has worked for my data set.
>> Unfortunately the key is not distributed very well. That is why I was
>> wondering how you do a scan (using start and end row) with a random row key.
>>
>> Thanks
>>
>> -Pete
>>
>> PS. I use <date>|<id> since the id is variable length and this was my
first
>> attempt. I know have a months worth of data and for my next phase I will
>> probably reverse the <date> <id> order since it will work either way.
>>
>>
>> On Sat, 29 Jan 2011 21:50:16 -0800, Ryan Rawson <ryanobjc@gmail.com> wrote:
>>
>>> Hey,
>>>
>>> So variable length keys and lexographical sorting makes it a little
>>> tricky to do Scans and get exactly what you want.  This has a lot to
>>> do with the ascii table too, and the numerical values.  Let consult
>>> (http://www.asciitable.com/) while we work this example through:
>>>
>>> Take a separation character of | as your code uses.  This is decimal
>>> 124, placing it way above both the lower and upper case letters AND
>>> numbers, that is good.
>>>
>>> Now you have something like this:
>>>
>>> 1234|a_string
>>> 1234|other_string
>>>
>>> now we want to find all rows "belonging to" 1234, so we do a start row
>>> of '1234|', but what for the end key? Well, let's try... '1234}', that
>>> might work, oh wait, here is another key:
>>>
>>> 12345|foo
>>>
>>> ok so '5' < '|' so it should short like so:
>>> 1234|a_string
>>> 1234|other_string
>>> 12345|foo
>>>
>>> hmm well how does our end row compare? well '5' < '}' so '1234}' is
>>> still "larger" than '12345|foo' so that row would be incorrectly
>>> included in the scan results assuming we only want '1234' related
>>> rows.
>>>
>>> Ok, well maybe a better solution is to pick a lower ascii?  Well
>>> outside of the control characters, space is the lowest character at
>>> 32, 33 is '!' so perhaps ! would be a better choice.  So you could
>>> choose an end double quote as in '1234"' to define your 'stop row'.
>>> Now you would be prohibited from using any character smaller than '33'
>>> in your strings, which is kind of a non ideal solution.
>>>
>>> This is all pretty clumsy, and doesnt work great in these variable
>>> length separated strings.
>>>
>>> The ultimate solution is to use the PrefixFilter, which is configured as
>>> such:
>>> byte[] start_row = Bytes.toBytes("1234|");
>>> Scan s = new Scan(start_row);
>>> s.setFilter(new PrefixFilter(start_row));
>>> // do scan.
>>>
>>> that way no matter what sortability your separator is, you will get
>>> the answer you want every time.
>>>
>>>
>>>
>>> Another way to do compound keys is to go pure-binary.  For example I
>>> want a key that is 2 integers, so I can do this:
>>> int part1 = ... ;
>>> int part2 = ... ;
>>> byte[] row_key = Bytes.add(Bytes.toBytes(part1), Bytes.toBytes(part2));
>>>
>>> Now you can also search for all rows starting with 'target' like such:
>>> int target = ... ;
>>> // start key is 'target', stop key is 'target+1'
>>> Scan s = new Scan(Bytes.toBytes(target), Bytes.toBytes(target+1));
>>>
>>> And you get exactly what you want, nothing more or less (all rows
>>> starting with 'target').
>>>
>>> The lexicographic comparison is very tricky sometimes. One quick tip
>>> is that if your numbers (longs, ints) are big endian encoded (all the
>>> utilities in Bytes.java do so), then the lexicographic sorting is
>>> equal to the numeric sorting.  Otherwise if you do strings you end up
>>> with:
>>> 1
>>> 11
>>> 2
>>> 3
>>>
>>> and things are 'out of order'... if that is important, you can pad it
>>> with 0s - dont forget to use the proper amount, which is 10 digits for
>>> ints, and 19 for longs.  Or consider using binary encoding as above.
>>>
>>> -ryan
>>>
>>> On Sat, Jan 29, 2011 at 12:50 AM, Tatsuya Kawano <tatsuya6502@gmail.com>
>>> wrote:
>>>>
>>>> Hi Pete,
>>>>
>>>> You're right. If you use random keys, you will never know the start /
>>>> end keys for scan. What you really want to do is to deign the key that
>>>> will distribute well for writes but also has the certain locality for
>>>> scans.
>>>>
>>>> You probably have the ideal key already (ID|Date). If you don't make
>>>> entire key to be random but just the ID part, you could get a good
>>>> distribution at write time because writes for different IDs will be
>>>> distributed across the regions, and you also could get a good scan
>>>> performance when you scan between certain dates for a specific ID
>>>> because rows for the ID will be stored together in one region.
>>>>
>>>> Thanks,
>>>> Tatsuya
>>>>
>>>>
>>>> 2011/1/29 Peter Haidinyak <phaidinyak@local.com>:
>>>>>
>>>>> I know they are always sorted but if they are how do you know which row
>>>>> key belong to which data? Currently I use a row key of ID|Date so I always
>>>>> know what the startrow and endrow should be. I know I'm missing something
>>>>> really fundamental here. :-(
>>>>>
>>>>> Thanks
>>>>>
>>>>> -Pete
>>>>>
>>>>> -----Original Message-----
>>>>> From: tsuna [mailto:tsunanet@gmail.com]
>>>>> Sent: Friday, January 28, 2011 12:14 PM
>>>>> To: user@hbase.apache.org
>>>>> Subject: Re: Row Keys
>>>>>
>>>>> On Fri, Jan 28, 2011 at 12:09 PM, Peter Haidinyak <phaidinyak@local.com>
>>>>> wrote:
>>>>>>
>>>>>>       This is going to seem like a dumb question but it is recommended
>>>>>> that you use a random key to spread the insert/read load among your
region
>>>>>> servers. My question is if I am using a scan with startrow and endrow
 how
>>>>>> does that work with random row keys?
>>>>>
>>>>> The keys are always sorted.  So if you generate random keys, you'll
>>>>> get your data back in a random order.
>>>>> What is recommended depends on the specific problem you're trying to
>>>>> solve.  But generally, one of the strengths of HBase is that the rows
>>>>> are sorted, so sequential scanning is efficient (thanks to data
>>>>> locality).
>>>>>
>>>>> --
>>>>> Benoit "tsuna" Sigoure
>>>>> Software Engineer @ www.StumbleUpon.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> 河野 達也
>>>> Tatsuya Kawano (Mr.)
>>>> Tokyo, Japan
>>>>
>>>> twitter: http://twitter.com/tatsuya6502
>>>>
>>
>>
>

Mime
View raw message