hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Em <mailformailingli...@yahoo.de>
Subject Re: Scan triggered per page-request, performance-impacts?
Date Tue, 05 Jun 2012 04:50:14 GMT

what do you mean by endpoint?

It would look more like

T2 {
   rowkey: t1_id-(Long.MAX_VALUE - time)
      family: qualifier = dummyDataSinceOnlyTheRowkeyMatters

For every t1_id associated with a specific object, one gets the newest
entry in the T2-table (newest in relation to the key, not the internal
timestamp of creation).
This data is then sorted by the time part of the returned rowkeys to get
the Top N of these.
And then you get N records from t1 again.

At last, that's what I thought about, though I am not sure that this is
the most efficient way.

Kind regards,

Am 05.06.2012 04:33, schrieb NNever:
> Does the Schema like this:
> T2{
>   rowkey: rs-time row
>    {
>        family:qualifier =  t1's row
>    }
> }
> Then you Scan the newest 1000 from T2, and each get it's t1Row, then do
> 1000 Gets from T1 for one page?
> 2012/6/5 NNever <nneverwei@gmail.com>
>> '- I'd like to do the top N stuff on the server side to reduce traffic,
>> will this be possible? '
>> Endpoint?
>> 2012/6/5 Em <mailformailinglists@yahoo.de>
>>> Hello list,
>>> let's say I have to fetch a lot of rows for a page-request (say
>>> 1.000-2.000).
>>> The row-keys are a composition of a fixed id of an object and a
>>> sequential ever-increasing id. Salting those keys for balancing may be
>>> taken into consideration.
>>> I want to do a Join like this one expressed in SQL:
>>> SELECT t1.columns FROM t1
>>> JOIN t2 ON (t1.id = t2.id)
>>> WHERE t2.id = fixedID-prefix
>>> I know that HBase does not support that out of the box.
>>> My approach is to have all the fixed-ids as columns of a row in t1.
>>> Selecting a row, I fetch those columns that are of interest for me,
>>> where each column contains a fixedID for t2.
>>> Now I do a scan on t2 for each fixedID which should return me exactly
>>> one value per fixedID (it's kind of a reverse-timestamp-approach like in
>>> the HBase-book).
>>> Furthermore I am really only interested in the key itself. I don't care
>>> about the columns (t2 is more like an index).
>>> Having fetched a row per fixedID, I sort based on the sequential part of
>>> their key and get the top N.
>>> For those top N I'll fetch data from t1.
>>> The usecase is to fetch the top N most recent entitys of t1 that are
>>> associated with a specific entity in t1 by using t2 as an index.
>>> T2 has one extra benefit over t1: You can do range-scans, if neccessary.
>>> Questions:
>>> - since this is triggered by a page-request: Will this return with low
>>> latency?
>>> - is there a possibility to do those Scans in a batch? Maybe I can
>>> combine them into one big scanner, using a custom filter for what I want?
>>> - do you have thoughts on improving this type of request?
>>> - I'd like to do the top N stuff on the server side to reduce traffic,
>>> will this be possible?
>>> - I am not sure whether a Scan is really what I want. Maybe a Multiget
>>> will fit my needs better combined with a RowFilter?
>>> I really work hard on finding the best approach of mapping this
>>> m:n-relation to a HBase schema - so any help is appreciated.
>>> Please note: I haven't written any line of HBase code so far. Currently
>>> I am studying books, blog-posts, slides and the mailinglists for
>>> learning more about HBase.
>>> Thanks!
>>> Kind regards,
>>> Em

View raw message