hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Schäfer <syrious3...@yahoo.de>
Subject Re: How to query by rowKey-infix
Date Thu, 02 Aug 2012 12:23:19 GMT
OK,

at first I will try the scans.

If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) to be able to use
coprocessors.

Currently I'm stuck at the scans because it requires two steps (therefore some kind of filter
chaining)

The key:  userId-dateInMllis-sessionId

At first I need to extract dateInMllis with regex or substring (using special delimiters for
date)

Second, the extracted value must be parsed to Long and set to a RowFilter Comparator like
this:





----- Ursprüngliche Message -----
Von: Michael Segel <michael_segel@hotmail.com>
An: user@hbase.apache.org
CC: 
Gesendet: 13:52 Mittwoch, 1.August 2012
Betreff: Re: How to query by rowKey-infix

Actually w coprocessors you can create a secondary index in short order. 
Then your cost is going to be 2 fetches. Trying to do a partial table scan will be more expensive.


On Jul 31, 2012, at 12:41 PM, Matt Corgan <mcorgan@hotpads.com> wrote:

> When deciding between a table scan vs secondary index, you should try to
> estimate what percent of the underlying data blocks will be used in the
> query.  By default, each block is 64KB.
> 
> If each user's data is small and you are fitting multiple users per block,
> then you're going to need all the blocks, so a tablescan is better because
> it's simpler.  If each user has 1MB+ data then you will want to pick out
> the individual blocks relevant to each date.  The secondary index will help
> you go directly to those sparse blocks, but with a cost in complexity,
> consistency, and extra denormalized data that knocks primary data out of
> your block cache.
> 
> If latency is not a concern, I would start with the table scan.  If that's
> too slow you add the secondary index, and if you still need it faster you
> do the primary key lookups in parallel as Jerry mentions.
> 
> Matt
> 
> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <chilinglam@gmail.com> wrote:
> 
>> Hi Chris:
>> 
>> I'm thinking about building a secondary index for primary key lookup, then
>> query using the primary keys in parallel.
>> 
>> I'm interested to see if there is other option too.
>> 
>> Best Regards,
>> 
>> Jerry
>> 
>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <syrious3000@yahoo.de
>>> wrote:
>> 
>>> Hello there,
>>> 
>>> I designed a row key for queries that need best performance (~100 ms)
>>> which looks like this:
>>> 
>>> userId-date-sessionId
>>> 
>>> These queries(scans) are always based on a userId and sometimes
>>> additionally on a date, too.
>>> That's no problem with the key above.
>>> 
>>> However, another kind of queries shall be based on a given time range
>>> whereas the outermost left userId is not given or known.
>>> In this case I need to get all rows covering the given time range with
>>> their date to create a daily reporting.
>>> 
>>> As I can't set wildcards at the beginning of a left-based index for the
>>> scan,
>>> I only see the possibility to scan the index of the whole table to
>> collect
>>> the
>>> rowKeys that are inside the timerange I'm interested in.
>>> 
>>> Is there a more elegant way to collect rows within time range X?
>>> (Unfortunately, the date attribute is not equal to the timestamp that is
>>> stored by hbase automatically.)
>>> 
>>> Could/should one maybe leverage some kind of row key caching to
>> accelerate
>>> the collection process?
>>> Is that covered by the block cache?
>>> 
>>> Thanks in advance for any advice.
>>> 
>>> regards
>>> Chris
>>> 
>> 

Mime
View raw message