hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From NNever <nnever...@gmail.com>
Subject Re: Scan triggered per page-request, performance-impacts?
Date Wed, 06 Jun 2012 01:37:19 GMT
>  Am I able to do this with one scan?
No, I think (Unless you define a custom filter, but that may not fast
enough). And you may misunderstand the Scan in step 2.
For example, you subscribe Lars, Stack. Then there will be 2 Scan with
StartRow/StopRow, that is:

Scan scan1 = new Scan();
scan.setStartRow("Lars");
scan.setStopRow("Lars" + X);   ( x stand for a big enough char)
...doScan....(With filter or something to get the 1st record in the result)

Scan scan2 = new Scan();
scan.setStartRow("Stack");
scan.setStopRow("Stack" + X);   ( x stand for a big enough char)
...doScan....(With filter or something to get the 1st record in the result)

So, if the data in the Index Table are like this:
Lars_72345
Lars_72440
Lucy_13231
Lucy_23211
Lucy_24111
Stack_64561
Stack_65552
...

Then the Result should be:
Lars_72345,
Stack_64651.

> Unfortunately I haven't found that much about when do to a scan and when to
do a Get
When you know what the rowkey exactly is, then do Get (Just like you use a
ID, an ID map to a single row, you fetch this row only).
You only know the prefix of rowkey, then do Scan. (Scan can return more
than one result)

>You mean instead of designing the key [blogId]_[timestamp]  I should do it
this way:  [timestamp]_[blogId]?
No no... As I said, 'Sort all fethed IndexRows', those fetched Rows are
Scan result above:
Lars_72345 and Stack_64651.
Sort them by time, than you get:
Stack_64651,
Lars_72345

Then you can use 'Stack' and 'Lars', each to do Get in table 'Blog'.





2012/6/6 Em <mailformailinglists@yahoo.de>

> Thanks for your feedback!
>
> > 2.  Scan the newest from Index table for each subscribed blogID
> Am I able to do this with one scan?
> Since all my blogs are relevant, this could lead to a stop and start row
> with a range where almost every other blog in the database fits in
> (think of a blog starting with a and another blog starting with z - both
> are in my subscription list).
> I thought that filtering would only scale linearly with the size of the
> table itself?
>
> Unfortunately I haven't found that much about when do to a scan and when
> to do a Get. Especially if the keys are all starting differently.
>
> > 3.  Sort all fethed IndexRows by publication-date (Infect not sort all ,
> > but get the newest 3, may be faster)
> You mean instead of designing the key
> [blogId]_[timestamp]
> I should do it this way:
> [timestamp]_[blogId]?
>
> Well, it depends on your scenario.
> If you want to know the most recent blogposts globally or within a
> period of time you are absolutely right. But if you want to know it for
> a special user/blog limited on his subscription, this could be really
> slow if the most recent blog posts relevant to this user are relatively
> old.
>
> Did you mean that or something different?
>
> Kind regards,
> Em
>
> Am 05.06.2012 11:18, schrieb NNever:
> > Very clear now :).
> > Only one problem,
> >
> > blog {//this is t1 of my example
> >    blogposts {//the column family
> >       05.05.2012_something { the blog post },//this is a column
> >       06.05.2012_anything  { the blog post },
> >       05.06.2012_nothing   { the blog post }
> >    },
> > ...
> >
> > here,  05.05.2012_something may not support easily sort. you shoud fetch
> > out all posts and do sort, when post amount become huge, this can be
> >  terrible. Just change it to  reverseTime_postTitle, then you may use
> > ColumnPaginationFilter to easily fetch the newest 3 columns.
> >
> >
> > After all , when you want to get most recent subscribed posts for some
> one,
> > you do:
> > 1.  Get Blog.subscribed_blogs's all values(subscribed blogIDs)
> > ---------------------- get by row, not slow
> > 2.  Scan the newest from Index table for each subscribed blogID
> > -------------------- Scan with startRow and stopRow, not slow
> > 3.  Sort all fethed IndexRows by publication-date (Infect not sort all ,
> > but get the newest 3, may be faster) ---------------- not slow
> > 4.  Use the 3 blogIDs above, on table Blog, each get 3 newest columns on
> CF
> > blogposts ------------------ use dictionary-order, use
> > ColumnPaginationFilter, not slow, but i wonder how fast it will be
> > 5.  Compare those 9 posts, get newest 3 ------------ not slow
> > Overall, almost all search all use the rowkey, so the whole process may
> not
> > have much delay, I think.
> >
> >
> > Best Regards,
> > NN
> >
> >
> >
> > 2012/6/5 Em <mailformailinglists@yahoo.de>
> >
> >> Correction of my last sentences:
> >>> So, what happens if techchrunch writes a new blog posts?
> >>> It will create a new column in its row's blogposts-CF and trigger a
> >>> million writes in the index-table (which only writes keys and empty
> >>> values of 0byte length - I assume that's the cheapest write I can do).
> >> Of course I mean it will NOT trigger a million writes in the index
> >> table, but only ONE write for this post.
> >>
> >> Kind regards,
> >> Em
> >>
> >> Am 05.06.2012 10:00, schrieb Em:
> >>> NN,
> >>>
> >>> thanks for the pointing to Coprocessors. I'll take a look on them!
> >>>
> >>> Okay, I see that my descriptions are confusing.
> >>>
> >>> Let me give you an example of some simplified entitys in my tables:
> >>>
> >>> blog {//this is t1 of my example
> >>>     blogposts {//the column family
> >>>        05.05.2012_something { the blog post },//this is a column
> >>>        06.05.2012_anything  { the blog post },
> >>>        05.06.2012_nothing   { the blog post }
> >>>     },
> >>>     subscribed_blogs {
> >>>        Wil_Wheaton's Blog { date_of_subscription },
> >>>        Sheldon's Blog     { date_of_subscription },
> >>>        Penny's Blog       { date_of_subscription },
> >>>        ... hundreds of other blogs ...
> >>>     }
> >>> }
> >>>
> >>> This blog has 3 blogposts. Each column of the user's blogposts
> >>> column-family contains a blogpost, where the column-name contains the
> >>> date and the title. This way columns can be accessed ordered by date.
> >>> Now this blog (or better say its author) is following some other blogs.
> >>> I do not want to get the posts of the subscribed blogs and write it in
> >>> the blog's row (duplicating the posts of the followed blogs).
> >>> The reason is that you have too keep in synch with the original posts.
> >>> Furthermore a very popular blog could trigger millions of writes (at
> >>> last one write per user). This is too much.
> >>>
> >>> So I want to build an index, t2. Let's call this table "index".
> >>>
> >>> index {
> >>>     dummy_column_family {
> >>>        dummy_column { i do only care about the rowkey. }
> >>>    }
> >>> }
> >>>
> >>> If a blog writes a new post, I'll write that post into the blogposts
> >>> table for the blog and additionally in the index table.
> >>> The rowkey would look like:
> >>> [blog_id]_[Long.MAX_VALUE - publication-date] (doing it this way, they
> >>> are sorted by HBase in LIFO-order).
> >>> Note: The publication date could be in the future! So it's not the date
> >>> of creation.
> >>>
> >>> Now, if I have a blog and I subscribed to 1.000 other blogs as well: To
> >>> generate a list of the most recent blog-posts of the blogs I
> subscribed,
> >>> I do the following:
> >>>
> >>> Read every column of my subscribed_blogs column family (they contain
> the
> >>> other blogs' ids).
> >>>
> >>> For each column, I want to do a lookup in my index-table (Scan or Get,
> I
> >>> am not sure what to use, since one may be able to batch the stuff):
> >>> Get: blog_id* (the "*" means that the rowkey should start with the
> >>> specified blog_id).
> >>> I want to fetch only the most recent per blog_id.
> >>> Now I have 1.000 rowkeys, each containing a blog_id and a timestamp.
> >>> Let's sort by timestamp and get the top 3 (maybe I can do some part of
> >>> this work on the server side).
> >>> I see that my top 3 list contains a post from Wil Wheaton, one from
> >>> Sheldon and another one from Techchrunch.
> >>>
> >>> Now I'll do three Gets in my blog-table:
> >>> One for Wil Wheaton's blog, another for Sheldon's and one for
> >> Techchrunch.
> >>> Since their columns are sorted by date, I'll fetch the latest 3
> >>> blog-posts of each blog, returning 9 blog-posts in sum.
> >>> Now I am able to sort these 9 blog-posts by their date and display the
> >>> top-3.
> >>>
> >>> Why do I fetch the top-3 of each blog and sort them again?
> >>> Well, if Wil Wheaton wrote a blog-post yesterday, techcrunch wrote one
> >>> two days ago and sheldon wrote two posts today and I want to get the
> >>> three most recent posts of all my subscribed blogs, than techcrunch is
> >>> out of this list, since it has the 4th-most-recent blog-post.
> >>>
> >>> I hope the scenario is more clearer now.
> >>>
> >>> So, what happens if techchrunch writes a new blog posts?
> >>> It will create a new column in its row's blogposts-CF and trigger a
> >>> million writes in the index-table (which only writes keys and empty
> >>> values of 0byte length - I assume that's the cheapest write I can do).
> >>>
> >>> Kind regards,
> >>> Em
> >>>
> >>>
> >>> Am 05.06.2012 08:07, schrieb NNever:
> >>>> 1. Endpoint is a kind of Coprocessor, it was added in 0.92. You can
> >> though
> >>>> it a little like Relational-Database’s storedProcedure. It's some
> >> logicals
> >>>> run on HBase server side. With it you may reduce your app's RPC calls,
> >> or
> >>>> as you said,  reduce traffic .
> >>>> you can get some help on Coprocessor/Endpoint from here:
> >>>> https://blogs.apache.org/hbase/entry/coprocessor_introduction
> >>>> 2. I still a little confuse what exactly you want with this table
> struct
> >>>> (Srry for that but my mother-language is not English).
> >>>> You mean t1 is the original data of some ojects,
> >>>> then t2 keep something about the object in t1?(like logs, 10:11 em
> check
> >>>> t1obj1; 10:13 em buy t1obj1; 10:30 em tookaway t1obj1)?
> >>>> 3. You said 'This data is then sorted by the time part of the returned
> >>>> rowkeys to get
> >>>> the Top N of these.'. Well there may be no necessary to do the sort.
> >> HBase
> >>>> keeps data in dictionary-order. Then you just fetch N of them, they
> are
> >>>> already ordered.
> >>>> 4. I use HBase not long , infectly still a nood on it :) .  I would
be
> >> glad
> >>>> anything can help you.
> >>>>
> >>>> Best Regards,
> >>>> NN
> >>>>
> >>>>
> >>>> 2012/6/5 Em <mailformailinglists@yahoo.de>
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> what do you mean by endpoint?
> >>>>>
> >>>>> It would look more like
> >>>>>
> >>>>> T2 {
> >>>>>   rowkey: t1_id-(Long.MAX_VALUE - time)
> >>>>>   {
> >>>>>      family: qualifier = dummyDataSinceOnlyTheRowkeyMatters
> >>>>>   }
> >>>>> }
> >>>>>
> >>>>> For every t1_id associated with a specific object, one gets the
> newest
> >>>>> entry in the T2-table (newest in relation to the key, not the
> internal
> >>>>> timestamp of creation).
> >>>>> This data is then sorted by the time part of the returned rowkeys
to
> >> get
> >>>>> the Top N of these.
> >>>>> And then you get N records from t1 again.
> >>>>>
> >>>>> At last, that's what I thought about, though I am not sure that
this
> is
> >>>>> the most efficient way.
> >>>>>
> >>>>> Kind regards,
> >>>>> Em
> >>>>>
> >>>>> Am 05.06.2012 04:33, schrieb NNever:
> >>>>>> Does the Schema like this:
> >>>>>>
> >>>>>> T2{
> >>>>>>   rowkey: rs-time row
> >>>>>>    {
> >>>>>>        family:qualifier =  t1's row
> >>>>>>    }
> >>>>>> }
> >>>>>>
> >>>>>> Then you Scan the newest 1000 from T2, and each get it's t1Row,
then
> >> do
> >>>>>> 1000 Gets from T1 for one page?
> >>>>>>
> >>>>>> 2012/6/5 NNever <nneverwei@gmail.com>
> >>>>>>
> >>>>>>> '- I'd like to do the top N stuff on the server side to
reduce
> >> traffic,
> >>>>>>> will this be possible? '
> >>>>>>>
> >>>>>>> Endpoint?
> >>>>>>>
> >>>>>>>
> >>>>>>> 2012/6/5 Em <mailformailinglists@yahoo.de>
> >>>>>>>
> >>>>>>>> Hello list,
> >>>>>>>>
> >>>>>>>> let's say I have to fetch a lot of rows for a page-request
(say
> >>>>>>>> 1.000-2.000).
> >>>>>>>> The row-keys are a composition of a fixed id of an object
and a
> >>>>>>>> sequential ever-increasing id. Salting those keys for
balancing
> may
> >> be
> >>>>>>>> taken into consideration.
> >>>>>>>>
> >>>>>>>> I want to do a Join like this one expressed in SQL:
> >>>>>>>>
> >>>>>>>> SELECT t1.columns FROM t1
> >>>>>>>> JOIN t2 ON (t1.id = t2.id)
> >>>>>>>> WHERE t2.id = fixedID-prefix
> >>>>>>>>
> >>>>>>>> I know that HBase does not support that out of the box.
> >>>>>>>> My approach is to have all the fixed-ids as columns
of a row in
> t1.
> >>>>>>>> Selecting a row, I fetch those columns that are of interest
for
> me,
> >>>>>>>> where each column contains a fixedID for t2.
> >>>>>>>> Now I do a scan on t2 for each fixedID which should
return me
> >> exactly
> >>>>>>>> one value per fixedID (it's kind of a reverse-timestamp-approach
> >> like
> >>>>> in
> >>>>>>>> the HBase-book).
> >>>>>>>> Furthermore I am really only interested in the key itself.
I don't
> >> care
> >>>>>>>> about the columns (t2 is more like an index).
> >>>>>>>> Having fetched a row per fixedID, I sort based on the
sequential
> >> part
> >>>>> of
> >>>>>>>> their key and get the top N.
> >>>>>>>> For those top N I'll fetch data from t1.
> >>>>>>>>
> >>>>>>>> The usecase is to fetch the top N most recent entitys
of t1 that
> are
> >>>>>>>> associated with a specific entity in t1 by using t2
as an index.
> >>>>>>>> T2 has one extra benefit over t1: You can do range-scans,
if
> >>>>> neccessary.
> >>>>>>>>
> >>>>>>>> Questions:
> >>>>>>>> - since this is triggered by a page-request: Will this
return with
> >> low
> >>>>>>>> latency?
> >>>>>>>> - is there a possibility to do those Scans in a batch?
Maybe I can
> >>>>>>>> combine them into one big scanner, using a custom filter
for what
> I
> >>>>> want?
> >>>>>>>> - do you have thoughts on improving this type of request?
> >>>>>>>> - I'd like to do the top N stuff on the server side
to reduce
> >> traffic,
> >>>>>>>> will this be possible?
> >>>>>>>> - I am not sure whether a Scan is really what I want.
Maybe a
> >> Multiget
> >>>>>>>> will fit my needs better combined with a RowFilter?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I really work hard on finding the best approach of mapping
this
> >>>>>>>> m:n-relation to a HBase schema - so any help is appreciated.
> >>>>>>>>
> >>>>>>>> Please note: I haven't written any line of HBase code
so far.
> >> Currently
> >>>>>>>> I am studying books, blog-posts, slides and the mailinglists
for
> >>>>>>>> learning more about HBase.
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>> Kind regards,
> >>>>>>>> Em
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message