hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Em <mailformailingli...@yahoo.de>
Subject Re: Scan triggered per page-request, performance-impacts?
Date Tue, 05 Jun 2012 08:03:36 GMT
Correction of my last sentences:
> So, what happens if techchrunch writes a new blog posts?
> It will create a new column in its row's blogposts-CF and trigger a
> million writes in the index-table (which only writes keys and empty
> values of 0byte length - I assume that's the cheapest write I can do).
Of course I mean it will NOT trigger a million writes in the index
table, but only ONE write for this post.

Kind regards,
Em

Am 05.06.2012 10:00, schrieb Em:
> NN,
> 
> thanks for the pointing to Coprocessors. I'll take a look on them!
> 
> Okay, I see that my descriptions are confusing.
> 
> Let me give you an example of some simplified entitys in my tables:
> 
> blog {//this is t1 of my example
>     blogposts {//the column family
>        05.05.2012_something { the blog post },//this is a column
>        06.05.2012_anything  { the blog post },
>        05.06.2012_nothing   { the blog post }
>     },
>     subscribed_blogs {
>        Wil_Wheaton's Blog { date_of_subscription },
>        Sheldon's Blog     { date_of_subscription },
>        Penny's Blog       { date_of_subscription },
>        ... hundreds of other blogs ...
>     }
> }
> 
> This blog has 3 blogposts. Each column of the user's blogposts
> column-family contains a blogpost, where the column-name contains the
> date and the title. This way columns can be accessed ordered by date.
> Now this blog (or better say its author) is following some other blogs.
> I do not want to get the posts of the subscribed blogs and write it in
> the blog's row (duplicating the posts of the followed blogs).
> The reason is that you have too keep in synch with the original posts.
> Furthermore a very popular blog could trigger millions of writes (at
> last one write per user). This is too much.
> 
> So I want to build an index, t2. Let's call this table "index".
> 
> index {
>     dummy_column_family {
>        dummy_column { i do only care about the rowkey. }
>    }
> }
> 
> If a blog writes a new post, I'll write that post into the blogposts
> table for the blog and additionally in the index table.
> The rowkey would look like:
> [blog_id]_[Long.MAX_VALUE - publication-date] (doing it this way, they
> are sorted by HBase in LIFO-order).
> Note: The publication date could be in the future! So it's not the date
> of creation.
> 
> Now, if I have a blog and I subscribed to 1.000 other blogs as well: To
> generate a list of the most recent blog-posts of the blogs I subscribed,
> I do the following:
> 
> Read every column of my subscribed_blogs column family (they contain the
> other blogs' ids).
> 
> For each column, I want to do a lookup in my index-table (Scan or Get, I
> am not sure what to use, since one may be able to batch the stuff):
> Get: blog_id* (the "*" means that the rowkey should start with the
> specified blog_id).
> I want to fetch only the most recent per blog_id.
> Now I have 1.000 rowkeys, each containing a blog_id and a timestamp.
> Let's sort by timestamp and get the top 3 (maybe I can do some part of
> this work on the server side).
> I see that my top 3 list contains a post from Wil Wheaton, one from
> Sheldon and another one from Techchrunch.
> 
> Now I'll do three Gets in my blog-table:
> One for Wil Wheaton's blog, another for Sheldon's and one for Techchrunch.
> Since their columns are sorted by date, I'll fetch the latest 3
> blog-posts of each blog, returning 9 blog-posts in sum.
> Now I am able to sort these 9 blog-posts by their date and display the
> top-3.
> 
> Why do I fetch the top-3 of each blog and sort them again?
> Well, if Wil Wheaton wrote a blog-post yesterday, techcrunch wrote one
> two days ago and sheldon wrote two posts today and I want to get the
> three most recent posts of all my subscribed blogs, than techcrunch is
> out of this list, since it has the 4th-most-recent blog-post.
> 
> I hope the scenario is more clearer now.
> 
> So, what happens if techchrunch writes a new blog posts?
> It will create a new column in its row's blogposts-CF and trigger a
> million writes in the index-table (which only writes keys and empty
> values of 0byte length - I assume that's the cheapest write I can do).
> 
> Kind regards,
> Em
> 
> 
> Am 05.06.2012 08:07, schrieb NNever:
>> 1. Endpoint is a kind of Coprocessor, it was added in 0.92. You can though
>> it a little like Relational-Database’s storedProcedure. It's some logicals
>> run on HBase server side. With it you may reduce your app's RPC calls, or
>> as you said,  reduce traffic .
>> you can get some help on Coprocessor/Endpoint from here:
>> https://blogs.apache.org/hbase/entry/coprocessor_introduction
>> 2. I still a little confuse what exactly you want with this table struct
>> (Srry for that but my mother-language is not English).
>> You mean t1 is the original data of some ojects,
>> then t2 keep something about the object in t1?(like logs, 10:11 em check
>> t1obj1; 10:13 em buy t1obj1; 10:30 em tookaway t1obj1)?
>> 3. You said 'This data is then sorted by the time part of the returned
>> rowkeys to get
>> the Top N of these.'. Well there may be no necessary to do the sort. HBase
>> keeps data in dictionary-order. Then you just fetch N of them, they are
>> already ordered.
>> 4. I use HBase not long , infectly still a nood on it :) .  I would be glad
>> anything can help you.
>>
>> Best Regards,
>> NN
>>
>>
>> 2012/6/5 Em <mailformailinglists@yahoo.de>
>>
>>> Hi,
>>>
>>> what do you mean by endpoint?
>>>
>>> It would look more like
>>>
>>> T2 {
>>>   rowkey: t1_id-(Long.MAX_VALUE - time)
>>>   {
>>>      family: qualifier = dummyDataSinceOnlyTheRowkeyMatters
>>>   }
>>> }
>>>
>>> For every t1_id associated with a specific object, one gets the newest
>>> entry in the T2-table (newest in relation to the key, not the internal
>>> timestamp of creation).
>>> This data is then sorted by the time part of the returned rowkeys to get
>>> the Top N of these.
>>> And then you get N records from t1 again.
>>>
>>> At last, that's what I thought about, though I am not sure that this is
>>> the most efficient way.
>>>
>>> Kind regards,
>>> Em
>>>
>>> Am 05.06.2012 04:33, schrieb NNever:
>>>> Does the Schema like this:
>>>>
>>>> T2{
>>>>   rowkey: rs-time row
>>>>    {
>>>>        family:qualifier =  t1's row
>>>>    }
>>>> }
>>>>
>>>> Then you Scan the newest 1000 from T2, and each get it's t1Row, then do
>>>> 1000 Gets from T1 for one page?
>>>>
>>>> 2012/6/5 NNever <nneverwei@gmail.com>
>>>>
>>>>> '- I'd like to do the top N stuff on the server side to reduce traffic,
>>>>> will this be possible? '
>>>>>
>>>>> Endpoint?
>>>>>
>>>>>
>>>>> 2012/6/5 Em <mailformailinglists@yahoo.de>
>>>>>
>>>>>> Hello list,
>>>>>>
>>>>>> let's say I have to fetch a lot of rows for a page-request (say
>>>>>> 1.000-2.000).
>>>>>> The row-keys are a composition of a fixed id of an object and a
>>>>>> sequential ever-increasing id. Salting those keys for balancing may
be
>>>>>> taken into consideration.
>>>>>>
>>>>>> I want to do a Join like this one expressed in SQL:
>>>>>>
>>>>>> SELECT t1.columns FROM t1
>>>>>> JOIN t2 ON (t1.id = t2.id)
>>>>>> WHERE t2.id = fixedID-prefix
>>>>>>
>>>>>> I know that HBase does not support that out of the box.
>>>>>> My approach is to have all the fixed-ids as columns of a row in t1.
>>>>>> Selecting a row, I fetch those columns that are of interest for me,
>>>>>> where each column contains a fixedID for t2.
>>>>>> Now I do a scan on t2 for each fixedID which should return me exactly
>>>>>> one value per fixedID (it's kind of a reverse-timestamp-approach
like
>>> in
>>>>>> the HBase-book).
>>>>>> Furthermore I am really only interested in the key itself. I don't
care
>>>>>> about the columns (t2 is more like an index).
>>>>>> Having fetched a row per fixedID, I sort based on the sequential
part
>>> of
>>>>>> their key and get the top N.
>>>>>> For those top N I'll fetch data from t1.
>>>>>>
>>>>>> The usecase is to fetch the top N most recent entitys of t1 that
are
>>>>>> associated with a specific entity in t1 by using t2 as an index.
>>>>>> T2 has one extra benefit over t1: You can do range-scans, if
>>> neccessary.
>>>>>>
>>>>>> Questions:
>>>>>> - since this is triggered by a page-request: Will this return with
low
>>>>>> latency?
>>>>>> - is there a possibility to do those Scans in a batch? Maybe I can
>>>>>> combine them into one big scanner, using a custom filter for what
I
>>> want?
>>>>>> - do you have thoughts on improving this type of request?
>>>>>> - I'd like to do the top N stuff on the server side to reduce traffic,
>>>>>> will this be possible?
>>>>>> - I am not sure whether a Scan is really what I want. Maybe a Multiget
>>>>>> will fit my needs better combined with a RowFilter?
>>>>>>
>>>>>>
>>>>>> I really work hard on finding the best approach of mapping this
>>>>>> m:n-relation to a HBase schema - so any help is appreciated.
>>>>>>
>>>>>> Please note: I haven't written any line of HBase code so far. Currently
>>>>>> I am studying books, blog-posts, slides and the mailinglists for
>>>>>> learning more about HBase.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Kind regards,
>>>>>> Em
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Mime
View raw message