hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Em <mailformailingli...@yahoo.de>
Subject Re: Scan triggered per page-request, performance-impacts?
Date Tue, 05 Jun 2012 08:00:21 GMT

thanks for the pointing to Coprocessors. I'll take a look on them!

Okay, I see that my descriptions are confusing.

Let me give you an example of some simplified entitys in my tables:

blog {//this is t1 of my example
    blogposts {//the column family
       05.05.2012_something { the blog post },//this is a column
       06.05.2012_anything  { the blog post },
       05.06.2012_nothing   { the blog post }
    subscribed_blogs {
       Wil_Wheaton's Blog { date_of_subscription },
       Sheldon's Blog     { date_of_subscription },
       Penny's Blog       { date_of_subscription },
       ... hundreds of other blogs ...

This blog has 3 blogposts. Each column of the user's blogposts
column-family contains a blogpost, where the column-name contains the
date and the title. This way columns can be accessed ordered by date.
Now this blog (or better say its author) is following some other blogs.
I do not want to get the posts of the subscribed blogs and write it in
the blog's row (duplicating the posts of the followed blogs).
The reason is that you have too keep in synch with the original posts.
Furthermore a very popular blog could trigger millions of writes (at
last one write per user). This is too much.

So I want to build an index, t2. Let's call this table "index".

index {
    dummy_column_family {
       dummy_column { i do only care about the rowkey. }

If a blog writes a new post, I'll write that post into the blogposts
table for the blog and additionally in the index table.
The rowkey would look like:
[blog_id]_[Long.MAX_VALUE - publication-date] (doing it this way, they
are sorted by HBase in LIFO-order).
Note: The publication date could be in the future! So it's not the date
of creation.

Now, if I have a blog and I subscribed to 1.000 other blogs as well: To
generate a list of the most recent blog-posts of the blogs I subscribed,
I do the following:

Read every column of my subscribed_blogs column family (they contain the
other blogs' ids).

For each column, I want to do a lookup in my index-table (Scan or Get, I
am not sure what to use, since one may be able to batch the stuff):
Get: blog_id* (the "*" means that the rowkey should start with the
specified blog_id).
I want to fetch only the most recent per blog_id.
Now I have 1.000 rowkeys, each containing a blog_id and a timestamp.
Let's sort by timestamp and get the top 3 (maybe I can do some part of
this work on the server side).
I see that my top 3 list contains a post from Wil Wheaton, one from
Sheldon and another one from Techchrunch.

Now I'll do three Gets in my blog-table:
One for Wil Wheaton's blog, another for Sheldon's and one for Techchrunch.
Since their columns are sorted by date, I'll fetch the latest 3
blog-posts of each blog, returning 9 blog-posts in sum.
Now I am able to sort these 9 blog-posts by their date and display the

Why do I fetch the top-3 of each blog and sort them again?
Well, if Wil Wheaton wrote a blog-post yesterday, techcrunch wrote one
two days ago and sheldon wrote two posts today and I want to get the
three most recent posts of all my subscribed blogs, than techcrunch is
out of this list, since it has the 4th-most-recent blog-post.

I hope the scenario is more clearer now.

So, what happens if techchrunch writes a new blog posts?
It will create a new column in its row's blogposts-CF and trigger a
million writes in the index-table (which only writes keys and empty
values of 0byte length - I assume that's the cheapest write I can do).

Kind regards,

Am 05.06.2012 08:07, schrieb NNever:
> 1. Endpoint is a kind of Coprocessor, it was added in 0.92. You can though
> it a little like Relational-Database’s storedProcedure. It's some logicals
> run on HBase server side. With it you may reduce your app's RPC calls, or
> as you said,  reduce traffic .
> you can get some help on Coprocessor/Endpoint from here:
> https://blogs.apache.org/hbase/entry/coprocessor_introduction
> 2. I still a little confuse what exactly you want with this table struct
> (Srry for that but my mother-language is not English).
> You mean t1 is the original data of some ojects,
> then t2 keep something about the object in t1?(like logs, 10:11 em check
> t1obj1; 10:13 em buy t1obj1; 10:30 em tookaway t1obj1)?
> 3. You said 'This data is then sorted by the time part of the returned
> rowkeys to get
> the Top N of these.'. Well there may be no necessary to do the sort. HBase
> keeps data in dictionary-order. Then you just fetch N of them, they are
> already ordered.
> 4. I use HBase not long , infectly still a nood on it :) .  I would be glad
> anything can help you.
> Best Regards,
> NN
> 2012/6/5 Em <mailformailinglists@yahoo.de>
>> Hi,
>> what do you mean by endpoint?
>> It would look more like
>> T2 {
>>   rowkey: t1_id-(Long.MAX_VALUE - time)
>>   {
>>      family: qualifier = dummyDataSinceOnlyTheRowkeyMatters
>>   }
>> }
>> For every t1_id associated with a specific object, one gets the newest
>> entry in the T2-table (newest in relation to the key, not the internal
>> timestamp of creation).
>> This data is then sorted by the time part of the returned rowkeys to get
>> the Top N of these.
>> And then you get N records from t1 again.
>> At last, that's what I thought about, though I am not sure that this is
>> the most efficient way.
>> Kind regards,
>> Em
>> Am 05.06.2012 04:33, schrieb NNever:
>>> Does the Schema like this:
>>> T2{
>>>   rowkey: rs-time row
>>>    {
>>>        family:qualifier =  t1's row
>>>    }
>>> }
>>> Then you Scan the newest 1000 from T2, and each get it's t1Row, then do
>>> 1000 Gets from T1 for one page?
>>> 2012/6/5 NNever <nneverwei@gmail.com>
>>>> '- I'd like to do the top N stuff on the server side to reduce traffic,
>>>> will this be possible? '
>>>> Endpoint?
>>>> 2012/6/5 Em <mailformailinglists@yahoo.de>
>>>>> Hello list,
>>>>> let's say I have to fetch a lot of rows for a page-request (say
>>>>> 1.000-2.000).
>>>>> The row-keys are a composition of a fixed id of an object and a
>>>>> sequential ever-increasing id. Salting those keys for balancing may be
>>>>> taken into consideration.
>>>>> I want to do a Join like this one expressed in SQL:
>>>>> SELECT t1.columns FROM t1
>>>>> JOIN t2 ON (t1.id = t2.id)
>>>>> WHERE t2.id = fixedID-prefix
>>>>> I know that HBase does not support that out of the box.
>>>>> My approach is to have all the fixed-ids as columns of a row in t1.
>>>>> Selecting a row, I fetch those columns that are of interest for me,
>>>>> where each column contains a fixedID for t2.
>>>>> Now I do a scan on t2 for each fixedID which should return me exactly
>>>>> one value per fixedID (it's kind of a reverse-timestamp-approach like
>> in
>>>>> the HBase-book).
>>>>> Furthermore I am really only interested in the key itself. I don't care
>>>>> about the columns (t2 is more like an index).
>>>>> Having fetched a row per fixedID, I sort based on the sequential part
>> of
>>>>> their key and get the top N.
>>>>> For those top N I'll fetch data from t1.
>>>>> The usecase is to fetch the top N most recent entitys of t1 that are
>>>>> associated with a specific entity in t1 by using t2 as an index.
>>>>> T2 has one extra benefit over t1: You can do range-scans, if
>> neccessary.
>>>>> Questions:
>>>>> - since this is triggered by a page-request: Will this return with low
>>>>> latency?
>>>>> - is there a possibility to do those Scans in a batch? Maybe I can
>>>>> combine them into one big scanner, using a custom filter for what I
>> want?
>>>>> - do you have thoughts on improving this type of request?
>>>>> - I'd like to do the top N stuff on the server side to reduce traffic,
>>>>> will this be possible?
>>>>> - I am not sure whether a Scan is really what I want. Maybe a Multiget
>>>>> will fit my needs better combined with a RowFilter?
>>>>> I really work hard on finding the best approach of mapping this
>>>>> m:n-relation to a HBase schema - so any help is appreciated.
>>>>> Please note: I haven't written any line of HBase code so far. Currently
>>>>> I am studying books, blog-posts, slides and the mailinglists for
>>>>> learning more about HBase.
>>>>> Thanks!
>>>>> Kind regards,
>>>>> Em

View raw message