hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Em <mailformailingli...@yahoo.de>
Subject Re: Scan triggered per page-request, performance-impacts?
Date Wed, 06 Jun 2012 10:33:34 GMT
Hi NN,

answers are inline.

Am 06.06.2012 03:37, schrieb NNever:
>>  Am I able to do this with one scan?
> No, I think (Unless you define a custom filter, but that may not fast
> enough). And you may misunderstand the Scan in step 2.
> For example, you subscribe Lars, Stack. Then there will be 2 Scan with
> StartRow/StopRow, that is:
> 
> Scan scan1 = new Scan();
> scan.setStartRow("Lars");
> scan.setStopRow("Lars" + X);   ( x stand for a big enough char)
> ...doScan....(With filter or something to get the 1st record in the result)
> 
> Scan scan2 = new Scan();
> scan.setStartRow("Stack");
> scan.setStopRow("Stack" + X);   ( x stand for a big enough char)
> ...doScan....(With filter or something to get the 1st record in the result)
> 
> So, if the data in the Index Table are like this:
> Lars_72345
> Lars_72440
> Lucy_13231
> Lucy_23211
> Lucy_24111
> Stack_64561
> Stack_65552
> ...

Okay, I understand. Is there a way to batch these Scans more
efficiently? i.e. executing more than one Scan per RPC-roundtrip?
This way I would have to make n Scans which results in n RPC-roundtrips,
where n is the number of subscriptions I have.
For large n this would kill the performance.
As like with having 2.000 Gets (which is not possible here, as you said)
might be more performant, since you can group the Gets per Region which
leads in less RPC-roundtrips. Is there an equivalent with Scans?
I think you should be able to predict where a Scan will be executed,
given the start- and stop-rows.

> Then the Result should be:
> Lars_72345,
> Stack_64651.
> 
>> Unfortunately I haven't found that much about when do to a scan and when to
> do a Get
> When you know what the rowkey exactly is, then do Get (Just like you use a
> ID, an ID map to a single row, you fetch this row only).
> You only know the prefix of rowkey, then do Scan. (Scan can return more
> than one result)
Thanks!
Do you know when to use a RowFilter for Gets?
I don't see why they exists (I think they were mentioned for Gets
somewhere).

> 
>> You mean instead of designing the key [blogId]_[timestamp]  I should do it
> this way:  [timestamp]_[blogId]?
> No no... As I said, 'Sort all fethed IndexRows', those fetched Rows are
> Scan result above:
> Lars_72345 and Stack_64651.
> Sort them by time, than you get:
> Stack_64651,
> Lars_72345

I do not see how this differs from my explanation? Or do we mean the
same? :)

You lead me to an idea of how to improve my scenario.
One of the latest steps in my concept was to retrive the top N
index-entries, sort them by their timestamp and do a Get for each of the
N to retrive the N most recent blogposts for them.
Well, there is a better option:
Instead of returning the most recent entry for Lars by a Scan, I can
retrive the top N most recent entries for Lars in the index-table.
If I have to do my Scans recursively, since they are not executable in
Batch out of the box, I could specify a better start-row for "Stack":

Say Lars's third-most-recent blogpost's key in the index-table looks like:

Lars_74244

The top N for Stack have to start at
Stack_74244, since I only care about blogposts by Stack that are younger
than the ones I already have from Lars.
I think that the Scan should return instantly if there are no rowkeys
that fit this criteria, shouldn't it?

Kind regards,
Em

> 
> 
> 
> 2012/6/6 Em <mailformailinglists@yahoo.de>
> 
>> Thanks for your feedback!
>>
>>> 2.  Scan the newest from Index table for each subscribed blogID
>> Am I able to do this with one scan?
>> Since all my blogs are relevant, this could lead to a stop and start row
>> with a range where almost every other blog in the database fits in
>> (think of a blog starting with a and another blog starting with z - both
>> are in my subscription list).
>> I thought that filtering would only scale linearly with the size of the
>> table itself?
>>
>> Unfortunately I haven't found that much about when do to a scan and when
>> to do a Get. Especially if the keys are all starting differently.
>>
>>> 3.  Sort all fethed IndexRows by publication-date (Infect not sort all ,
>>> but get the newest 3, may be faster)
>> You mean instead of designing the key
>> [blogId]_[timestamp]
>> I should do it this way:
>> [timestamp]_[blogId]?
>>
>> Well, it depends on your scenario.
>> If you want to know the most recent blogposts globally or within a
>> period of time you are absolutely right. But if you want to know it for
>> a special user/blog limited on his subscription, this could be really
>> slow if the most recent blog posts relevant to this user are relatively
>> old.
>>
>> Did you mean that or something different?
>>
>> Kind regards,
>> Em
>>
>> Am 05.06.2012 11:18, schrieb NNever:
>>> Very clear now :).
>>> Only one problem,
>>>
>>> blog {//this is t1 of my example
>>>    blogposts {//the column family
>>>       05.05.2012_something { the blog post },//this is a column
>>>       06.05.2012_anything  { the blog post },
>>>       05.06.2012_nothing   { the blog post }
>>>    },
>>> ...
>>>
>>> here,  05.05.2012_something may not support easily sort. you shoud fetch
>>> out all posts and do sort, when post amount become huge, this can be
>>>  terrible. Just change it to  reverseTime_postTitle, then you may use
>>> ColumnPaginationFilter to easily fetch the newest 3 columns.
>>>
>>>
>>> After all , when you want to get most recent subscribed posts for some
>> one,
>>> you do:
>>> 1.  Get Blog.subscribed_blogs's all values(subscribed blogIDs)
>>> ---------------------- get by row, not slow
>>> 2.  Scan the newest from Index table for each subscribed blogID
>>> -------------------- Scan with startRow and stopRow, not slow
>>> 3.  Sort all fethed IndexRows by publication-date (Infect not sort all ,
>>> but get the newest 3, may be faster) ---------------- not slow
>>> 4.  Use the 3 blogIDs above, on table Blog, each get 3 newest columns on
>> CF
>>> blogposts ------------------ use dictionary-order, use
>>> ColumnPaginationFilter, not slow, but i wonder how fast it will be
>>> 5.  Compare those 9 posts, get newest 3 ------------ not slow
>>> Overall, almost all search all use the rowkey, so the whole process may
>> not
>>> have much delay, I think.
>>>
>>>
>>> Best Regards,
>>> NN
>>>
>>>
>>>
>>> 2012/6/5 Em <mailformailinglists@yahoo.de>
>>>
>>>> Correction of my last sentences:
>>>>> So, what happens if techchrunch writes a new blog posts?
>>>>> It will create a new column in its row's blogposts-CF and trigger a
>>>>> million writes in the index-table (which only writes keys and empty
>>>>> values of 0byte length - I assume that's the cheapest write I can do).
>>>> Of course I mean it will NOT trigger a million writes in the index
>>>> table, but only ONE write for this post.
>>>>
>>>> Kind regards,
>>>> Em
>>>>
>>>> Am 05.06.2012 10:00, schrieb Em:
>>>>> NN,
>>>>>
>>>>> thanks for the pointing to Coprocessors. I'll take a look on them!
>>>>>
>>>>> Okay, I see that my descriptions are confusing.
>>>>>
>>>>> Let me give you an example of some simplified entitys in my tables:
>>>>>
>>>>> blog {//this is t1 of my example
>>>>>     blogposts {//the column family
>>>>>        05.05.2012_something { the blog post },//this is a column
>>>>>        06.05.2012_anything  { the blog post },
>>>>>        05.06.2012_nothing   { the blog post }
>>>>>     },
>>>>>     subscribed_blogs {
>>>>>        Wil_Wheaton's Blog { date_of_subscription },
>>>>>        Sheldon's Blog     { date_of_subscription },
>>>>>        Penny's Blog       { date_of_subscription },
>>>>>        ... hundreds of other blogs ...
>>>>>     }
>>>>> }
>>>>>
>>>>> This blog has 3 blogposts. Each column of the user's blogposts
>>>>> column-family contains a blogpost, where the column-name contains the
>>>>> date and the title. This way columns can be accessed ordered by date.
>>>>> Now this blog (or better say its author) is following some other blogs.
>>>>> I do not want to get the posts of the subscribed blogs and write it in
>>>>> the blog's row (duplicating the posts of the followed blogs).
>>>>> The reason is that you have too keep in synch with the original posts.
>>>>> Furthermore a very popular blog could trigger millions of writes (at
>>>>> last one write per user). This is too much.
>>>>>
>>>>> So I want to build an index, t2. Let's call this table "index".
>>>>>
>>>>> index {
>>>>>     dummy_column_family {
>>>>>        dummy_column { i do only care about the rowkey. }
>>>>>    }
>>>>> }
>>>>>
>>>>> If a blog writes a new post, I'll write that post into the blogposts
>>>>> table for the blog and additionally in the index table.
>>>>> The rowkey would look like:
>>>>> [blog_id]_[Long.MAX_VALUE - publication-date] (doing it this way, they
>>>>> are sorted by HBase in LIFO-order).
>>>>> Note: The publication date could be in the future! So it's not the date
>>>>> of creation.
>>>>>
>>>>> Now, if I have a blog and I subscribed to 1.000 other blogs as well:
To
>>>>> generate a list of the most recent blog-posts of the blogs I
>> subscribed,
>>>>> I do the following:
>>>>>
>>>>> Read every column of my subscribed_blogs column family (they contain
>> the
>>>>> other blogs' ids).
>>>>>
>>>>> For each column, I want to do a lookup in my index-table (Scan or Get,
>> I
>>>>> am not sure what to use, since one may be able to batch the stuff):
>>>>> Get: blog_id* (the "*" means that the rowkey should start with the
>>>>> specified blog_id).
>>>>> I want to fetch only the most recent per blog_id.
>>>>> Now I have 1.000 rowkeys, each containing a blog_id and a timestamp.
>>>>> Let's sort by timestamp and get the top 3 (maybe I can do some part of
>>>>> this work on the server side).
>>>>> I see that my top 3 list contains a post from Wil Wheaton, one from
>>>>> Sheldon and another one from Techchrunch.
>>>>>
>>>>> Now I'll do three Gets in my blog-table:
>>>>> One for Wil Wheaton's blog, another for Sheldon's and one for
>>>> Techchrunch.
>>>>> Since their columns are sorted by date, I'll fetch the latest 3
>>>>> blog-posts of each blog, returning 9 blog-posts in sum.
>>>>> Now I am able to sort these 9 blog-posts by their date and display the
>>>>> top-3.
>>>>>
>>>>> Why do I fetch the top-3 of each blog and sort them again?
>>>>> Well, if Wil Wheaton wrote a blog-post yesterday, techcrunch wrote one
>>>>> two days ago and sheldon wrote two posts today and I want to get the
>>>>> three most recent posts of all my subscribed blogs, than techcrunch is
>>>>> out of this list, since it has the 4th-most-recent blog-post.
>>>>>
>>>>> I hope the scenario is more clearer now.
>>>>>
>>>>> So, what happens if techchrunch writes a new blog posts?
>>>>> It will create a new column in its row's blogposts-CF and trigger a
>>>>> million writes in the index-table (which only writes keys and empty
>>>>> values of 0byte length - I assume that's the cheapest write I can do).
>>>>>
>>>>> Kind regards,
>>>>> Em
>>>>>
>>>>>
>>>>> Am 05.06.2012 08:07, schrieb NNever:
>>>>>> 1. Endpoint is a kind of Coprocessor, it was added in 0.92. You can
>>>> though
>>>>>> it a little like Relational-Database’s storedProcedure. It's some
>>>> logicals
>>>>>> run on HBase server side. With it you may reduce your app's RPC calls,
>>>> or
>>>>>> as you said,  reduce traffic .
>>>>>> you can get some help on Coprocessor/Endpoint from here:
>>>>>> https://blogs.apache.org/hbase/entry/coprocessor_introduction
>>>>>> 2. I still a little confuse what exactly you want with this table
>> struct
>>>>>> (Srry for that but my mother-language is not English).
>>>>>> You mean t1 is the original data of some ojects,
>>>>>> then t2 keep something about the object in t1?(like logs, 10:11 em
>> check
>>>>>> t1obj1; 10:13 em buy t1obj1; 10:30 em tookaway t1obj1)?
>>>>>> 3. You said 'This data is then sorted by the time part of the returned
>>>>>> rowkeys to get
>>>>>> the Top N of these.'. Well there may be no necessary to do the sort.
>>>> HBase
>>>>>> keeps data in dictionary-order. Then you just fetch N of them, they
>> are
>>>>>> already ordered.
>>>>>> 4. I use HBase not long , infectly still a nood on it :) .  I would
be
>>>> glad
>>>>>> anything can help you.
>>>>>>
>>>>>> Best Regards,
>>>>>> NN
>>>>>>
>>>>>>
>>>>>> 2012/6/5 Em <mailformailinglists@yahoo.de>
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> what do you mean by endpoint?
>>>>>>>
>>>>>>> It would look more like
>>>>>>>
>>>>>>> T2 {
>>>>>>>   rowkey: t1_id-(Long.MAX_VALUE - time)
>>>>>>>   {
>>>>>>>      family: qualifier = dummyDataSinceOnlyTheRowkeyMatters
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>> For every t1_id associated with a specific object, one gets the
>> newest
>>>>>>> entry in the T2-table (newest in relation to the key, not the
>> internal
>>>>>>> timestamp of creation).
>>>>>>> This data is then sorted by the time part of the returned rowkeys
to
>>>> get
>>>>>>> the Top N of these.
>>>>>>> And then you get N records from t1 again.
>>>>>>>
>>>>>>> At last, that's what I thought about, though I am not sure that
this
>> is
>>>>>>> the most efficient way.
>>>>>>>
>>>>>>> Kind regards,
>>>>>>> Em
>>>>>>>
>>>>>>> Am 05.06.2012 04:33, schrieb NNever:
>>>>>>>> Does the Schema like this:
>>>>>>>>
>>>>>>>> T2{
>>>>>>>>   rowkey: rs-time row
>>>>>>>>    {
>>>>>>>>        family:qualifier =  t1's row
>>>>>>>>    }
>>>>>>>> }
>>>>>>>>
>>>>>>>> Then you Scan the newest 1000 from T2, and each get it's
t1Row, then
>>>> do
>>>>>>>> 1000 Gets from T1 for one page?
>>>>>>>>
>>>>>>>> 2012/6/5 NNever <nneverwei@gmail.com>
>>>>>>>>
>>>>>>>>> '- I'd like to do the top N stuff on the server side
to reduce
>>>> traffic,
>>>>>>>>> will this be possible? '
>>>>>>>>>
>>>>>>>>> Endpoint?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2012/6/5 Em <mailformailinglists@yahoo.de>
>>>>>>>>>
>>>>>>>>>> Hello list,
>>>>>>>>>>
>>>>>>>>>> let's say I have to fetch a lot of rows for a page-request
(say
>>>>>>>>>> 1.000-2.000).
>>>>>>>>>> The row-keys are a composition of a fixed id of an
object and a
>>>>>>>>>> sequential ever-increasing id. Salting those keys
for balancing
>> may
>>>> be
>>>>>>>>>> taken into consideration.
>>>>>>>>>>
>>>>>>>>>> I want to do a Join like this one expressed in SQL:
>>>>>>>>>>
>>>>>>>>>> SELECT t1.columns FROM t1
>>>>>>>>>> JOIN t2 ON (t1.id = t2.id)
>>>>>>>>>> WHERE t2.id = fixedID-prefix
>>>>>>>>>>
>>>>>>>>>> I know that HBase does not support that out of the
box.
>>>>>>>>>> My approach is to have all the fixed-ids as columns
of a row in
>> t1.
>>>>>>>>>> Selecting a row, I fetch those columns that are of
interest for
>> me,
>>>>>>>>>> where each column contains a fixedID for t2.
>>>>>>>>>> Now I do a scan on t2 for each fixedID which should
return me
>>>> exactly
>>>>>>>>>> one value per fixedID (it's kind of a reverse-timestamp-approach
>>>> like
>>>>>>> in
>>>>>>>>>> the HBase-book).
>>>>>>>>>> Furthermore I am really only interested in the key
itself. I don't
>>>> care
>>>>>>>>>> about the columns (t2 is more like an index).
>>>>>>>>>> Having fetched a row per fixedID, I sort based on
the sequential
>>>> part
>>>>>>> of
>>>>>>>>>> their key and get the top N.
>>>>>>>>>> For those top N I'll fetch data from t1.
>>>>>>>>>>
>>>>>>>>>> The usecase is to fetch the top N most recent entitys
of t1 that
>> are
>>>>>>>>>> associated with a specific entity in t1 by using
t2 as an index.
>>>>>>>>>> T2 has one extra benefit over t1: You can do range-scans,
if
>>>>>>> neccessary.
>>>>>>>>>>
>>>>>>>>>> Questions:
>>>>>>>>>> - since this is triggered by a page-request: Will
this return with
>>>> low
>>>>>>>>>> latency?
>>>>>>>>>> - is there a possibility to do those Scans in a batch?
Maybe I can
>>>>>>>>>> combine them into one big scanner, using a custom
filter for what
>> I
>>>>>>> want?
>>>>>>>>>> - do you have thoughts on improving this type of
request?
>>>>>>>>>> - I'd like to do the top N stuff on the server side
to reduce
>>>> traffic,
>>>>>>>>>> will this be possible?
>>>>>>>>>> - I am not sure whether a Scan is really what I want.
Maybe a
>>>> Multiget
>>>>>>>>>> will fit my needs better combined with a RowFilter?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I really work hard on finding the best approach of
mapping this
>>>>>>>>>> m:n-relation to a HBase schema - so any help is appreciated.
>>>>>>>>>>
>>>>>>>>>> Please note: I haven't written any line of HBase
code so far.
>>>> Currently
>>>>>>>>>> I am studying books, blog-posts, slides and the mailinglists
for
>>>>>>>>>> learning more about HBase.
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> Kind regards,
>>>>>>>>>> Em
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
> 

Mime
View raw message