hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jtay...@salesforce.com>
Subject Re: Help in designing row key
Date Wed, 03 Jul 2013 11:42:48 GMT
Sure, but FYI Phoenix is not just faster, but much easier as well (as 
this email chain shows).

On 07/03/2013 04:25 AM, Flavio Pompermaier wrote:
> No, I've never seen Phoenix, but it looks like a very useful project!
> However I don't have such strict performance issues in my use case, I just
> want to have balanced regions as much as possible.
> So I think that in this case I will still use Bytes concatenation if
> someone confirm I'm doing it in the right way.
>
>
> On Wed, Jul 3, 2013 at 12:33 PM, James Taylor <jtaylor@salesforce.com>wrote:
>
>> Hi Flavio,
>> Have you had a look at Phoenix (https://github.com/**forcedotcom/phoenix<https://github.com/forcedotcom/phoenix>)?
>> It will allow you to model your multi-part row key like this:
>>
>> CREATE TABLE flavio.analytics (
>>      source INTEGER,
>>      type INTEGER,
>>      qual VARCHAR,
>>      hash VARCHAR,
>>      ts DATE
>>      CONSTRAINT pk PRIMARY KEY (source, type, qual, hash, ts) // Defines
>> columns that make up the row key
>> )
>>
>> Then you can issue SQL queries like this (to query for the last 7 days
>> worth of data):
>> SELECT * FROM flavio.analytics WHERE source IN (1,2,5) AND type IN (55,66)
>> AND ts > CURRENT_DATE() - 7
>>
>> This will internally take advantage of our SkipScan (http://phoenix-hbase.
>> **blogspot.com/2013/05/**demystifying-skip-scan-in-**phoenix.html<http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html>)
>> to jump through your key space similar to FuzzyRowFilter, but in parallel
>> from the client taking into account your region boundaries.
>>
>> Or do more complex GROUP BY queries like this (to aggregate over the last
>> 30 days worth of data, bucketized by day):
>> SELECT type,COUNT(*) FROM flavio.analytics WHERE ts > CURRENT_DATE() - 30
>> GROUP BY type,TRUNCATE(ts,'DAY')
>>
>> No need to worry about lexicographical sort order, flipping sign bits,
>> normalizing/padding integer values, and all the other nuances of working
>> with an API that works at the level of bytes. No need to write and manage
>> installation of your own coprocessors to make aggregation efficient,
>> perform topN queries, etc.
>>
>> HTH.
>>
>> Regards,
>> James
>> @JamesPlusPlus
>>
>>
>> On 07/03/2013 02:58 AM, Anoop John wrote:
>>
>>> When you make the RK and convert the int parts into byte[] ( Use
>>> org.apache.hadoop.hbase.util.**Bytes#toBytes(*int) *)  it will give 4
>>> bytes
>>> for every byte..  Be careful about the ordering...   When u convert a +ve
>>> and -ve integer into byte[] and u do Lexiographical compare (as done in
>>> HBase) u will see -ve number being greater than +ve..  If you dont have to
>>> do deal with -ve numbers no issues  :)
>>>
>>> Well when all the parts of the RK is of fixed width u will need any
>>> seperator??
>>>
>>> -Anoop-
>>>
>>> On Wed, Jul 3, 2013 at 2:44 PM, Flavio Pompermaier <pompermaier@okkam.it
>>>> wrote:
>>>   Yeah, I was thinking to use a normalization step in order to allow the
>>>> use
>>>> of FuzzyRowFilter but what is not clear to me is if integers must also be
>>>> normalized or not.
>>>> I will explain myself better. Suppose that i follow your advice and I
>>>> produce keys like:
>>>>    - 1|1|somehash|sometimestamp
>>>>    - 55|555|somehash|sometimestamp
>>>>
>>>> Whould they match the same pattern or do I have to normalize them to the
>>>> following?
>>>>    - 001|001|somehash|sometimestamp
>>>>    - 055|555|somehash|sometimestamp
>>>>
>>>> Moreover, I noticed that you used dots ('.') to separate things instead
>>>> of
>>>> pipe ('|')..is there a reason for that (maybe performance or whatever) or
>>>> is just your favourite separator?
>>>>
>>>> Best,
>>>> Flavio
>>>>
>>>>
>>>> On Wed, Jul 3, 2013 at 10:12 AM, Mike Axiak <mike@axiak.net> wrote:
>>>>
>>>>   I'm not sure if you're eliding this fact or not, but you'd be much
>>>>> better off if you used a fixed-width format for your keys. So in your
>>>>> example, you'd have:
>>>>>
>>>>> PATTERN: source(4-byte-int).type(4-**byte-int or smaller).fixed 128-bit
>>>>> hash.8-byte timestamp
>>>>>
>>>>> Example: \x00\x00\x00\x01\x00\x00\x02\**x03....
>>>>>
>>>>> The advantage of this is not only that it's significantly less data
>>>>> (remember your key is stored on each KeyValue), but also you can now
>>>>> use FuzzyRowFilter and other techniques to quickly perform scans. The
>>>>> disadvantage is that you have to normalize the source-> integer but
I
>>>>> find I can either store that in an enum or cache it for a long time so
>>>>> it's not a big issue.
>>>>>
>>>>> -Mike
>>>>>
>>>>> On Wed, Jul 3, 2013 at 4:05 AM, Flavio Pompermaier <
>>>>> pompermaier@okkam.it
>>>>>
>>>>> wrote:
>>>>>
>>>>>> Thank you very much for the great support!
>>>>>> This is how I thought to design my key:
>>>>>>
>>>>>> PATTERN: source|type|qualifier|hash(**name)|timestamp
>>>>>> EXAMPLE:
>>>>>> google|appliance|oven|**be9173589a7471a7179e928adc1a86**
>>>>>> f7|1372837702753
>>>>>>
>>>>>> Do you think my key could be good for my scope (my search will be
>>>>>> essentially by source or source|type)?
>>>>>> Another point is that initially I will not have so many sources,
so I
>>>>>>
>>>>> will
>>>>>
>>>>>> probably have only google|* but in the next phases there could be
more
>>>>>> sources..
>>>>>>
>>>>>> Best,
>>>>>> Flavio
>>>>>>
>>>>>> On Tue, Jul 2, 2013 at 7:53 PM, Ted Yu <yuzhihong@gmail.com>
wrote:
>>>>>>
>>>>>>   For #1, yes - the client receives less data after filtering.
>>>>>>> For #2, please take a look at TestMultiVersions
>>>>>>> (./src/test/java/org/apache/**hadoop/hbase/**TestMultiVersions.java
>>>>>>> in
>>>>>>>
>>>>>> 0.94)
>>>>> for time range:
>>>>>>>       scan = new Scan();
>>>>>>>
>>>>>>>       scan.setTimeRange(1000L, Long.MAX_VALUE);
>>>>>>> For row key selection, you need a filter. Take a look at
>>>>>>> FuzzyRowFilter.java
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On Tue, Jul 2, 2013 at 10:35 AM, Flavio Pompermaier <
>>>>>>>
>>>>>> pompermaier@okkam.it
>>>>>> wrote:
>>>>>>>>    Thanks for the reply! I thus have two questions more:
>>>>>>>>
>>>>>>>> 1) is it true that filtering on timestamps doesn't affect
>>>>>>>>
>>>>>>> performance..?
>>>>>> 2) could you send me a little snippet of how you would do such a
>>>>>>> filter
>>>>>> (by
>>>>>>>> row key + timestamps)? For example get all rows whose key
starts
>>>>>>>>
>>>>>>> with
>>>>>   'someid-' and whose timestamps is greater than some timestamp?
>>>>>>>> Best,
>>>>>>>> Flavio
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jul 2, 2013 at 6:25 PM, Ted Yu <yuzhihong@gmail.com>
wrote:
>>>>>>>>
>>>>>>>>   bq. Using timestamp in row-keys is discouraged
>>>>>>>>> The above is true.
>>>>>>>>> Prefixing row key with timestamp would create hot region.
>>>>>>>>>
>>>>>>>>> bq. should I filter by a simpler row-key plus a filter
on
>>>>>>>>>
>>>>>>>> timestamp?
>>>>>   You can do the above.
>>>>>>>>> On Tue, Jul 2, 2013 at 9:13 AM, Flavio Pompermaier <
>>>>>>>>>
>>>>>>>> pompermaier@okkam.it
>>>>>>>> wrote:
>>>>>>>>>> Hi to everybody,
>>>>>>>>>>
>>>>>>>>>> in my use case I have to perform batch analysis skipping
old
>>>>>>>>>>
>>>>>>>>> data.
>>>>>   For example, I want to process all rows created after a certain
>>>>>>>>> timestamp,
>>>>>>>>>
>>>>>>>>>> passed as parameter.
>>>>>>>>>>
>>>>>>>>>> What is the most effective way to do this?
>>>>>>>>>> Should I design my row-key to embed timestamp?
>>>>>>>>>> Or just filtering by timestamp of the row is fast
as well? Or
>>>>>>>>>>
>>>>>>>>> what
>>>>>   else?
>>>>>>>>> Initially I was thinking to compose my key as:
>>>>>>>>>> timestamp|source|title|type
>>>>>>>>>>
>>>>>>>>>> but:
>>>>>>>>>>
>>>>>>>>>> 1) Using timestamp in row-keys is discouraged
>>>>>>>>>> 2) If this design is ok, using this approach I still
have
>>>>>>>>>>
>>>>>>>>> problems
>>>>>   filtering by timestamp because I cannot found a way to
>>>>>>>>> numerically
>>>>>   filer
>>>>>>>>> (instead of alphanumerically/by string). Example:
>>>>>>>>>> 1372776400441|something has timestamp lesser
>>>>>>>>>> than 1372778470913|somethingelse but I cannot filter
all row
>>>>>>>>>>
>>>>>>>>> whose
>>>>> key
>>>>>>>> is
>>>>>>>>
>>>>>>>>> "numerically" greater than 1372776400441. Is it possible
to
>>>>>>>>> overcome
>>>>>> this
>>>>>>>>> issue?
>>>>>>>>>> 3) If this design is not ok, should I filter by a
simpler
>>>>>>>>>>
>>>>>>>>> row-key
>>>>> plus
>>>>>>>> a
>>>>>>>>
>>>>>>>>> filter on timestamp? Or what else?
>>>>>>>>>> Best,
>>>>>>>>>> Flavio
>>>>>>>>>>
>>>>>>>>>>
>
> --
>
> Flavio Pompermaier
> *Development Department
> *_______________________________________________
> *OKKAM**Srl **- www.okkam.it*
>
> *Phone:* +(39) 0461 283 702
> *Fax:* + (39) 0461 186 6433
> *Email:* f.pompermaier@okkam.it
> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
> *Registered office:* Trento (Italy), via Segantini 23
>
> Confidentially notice. This e-mail transmission may contain legally
> privileged and/or confidential information. Please do not read it if you
> are not the intended recipient(S). Any use, distribution, reproduction or
> disclosure by any other person is strictly prohibited. If you have received
> this e-mail in error, please notify the sender and destroy the original
> transmission and its attachments without reading or saving it in any manner.


Mime
View raw message