hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From NNever <nnever...@gmail.com>
Subject Re: Scan triggered per page-request, performance-impacts?
Date Thu, 07 Jun 2012 06:25:11 GMT
>Okay, I understand. Is there a way to batch these Scans more efficiently?
In my option,no. But you can move those scans into one Endpoint method, for
that you can only do a single call to hbase-server, sending all rowkeys(for
the startRow/stopRow) to server togerther.
This may reduce call nums to server, but may not reduce RPC-call
nums----For endpoint itself will raise RPC calls too.

> Do you know when to use a RowFilter for Gets?I don't see why they exists
(I think they were mentioned for Gets somewhere).
I never use RowFilter on Get before. When we do get, we know the rowkey, so
RowFilter seems no any function there.
If someone else had use RowFiler on get before may help you on this
question.

Yours,
NN


2012/6/6 Em <mailformailinglists@yahoo.de>

> Hi NN,
>
> answers are inline.
>
> Am 06.06.2012 03:37, schrieb NNever:
> >>  Am I able to do this with one scan?
> > No, I think (Unless you define a custom filter, but that may not fast
> > enough). And you may misunderstand the Scan in step 2.
> > For example, you subscribe Lars, Stack. Then there will be 2 Scan with
> > StartRow/StopRow, that is:
> >
> > Scan scan1 = new Scan();
> > scan.setStartRow("Lars");
> > scan.setStopRow("Lars" + X);   ( x stand for a big enough char)
> > ...doScan....(With filter or something to get the 1st record in the
> result)
> >
> > Scan scan2 = new Scan();
> > scan.setStartRow("Stack");
> > scan.setStopRow("Stack" + X);   ( x stand for a big enough char)
> > ...doScan....(With filter or something to get the 1st record in the
> result)
> >
> > So, if the data in the Index Table are like this:
> > Lars_72345
> > Lars_72440
> > Lucy_13231
> > Lucy_23211
> > Lucy_24111
> > Stack_64561
> > Stack_65552
> > ...
>
> Okay, I understand. Is there a way to batch these Scans more
> efficiently? i.e. executing more than one Scan per RPC-roundtrip?
> This way I would have to make n Scans which results in n RPC-roundtrips,
> where n is the number of subscriptions I have.
> For large n this would kill the performance.
> As like with having 2.000 Gets (which is not possible here, as you said)
> might be more performant, since you can group the Gets per Region which
> leads in less RPC-roundtrips. Is there an equivalent with Scans?
> I think you should be able to predict where a Scan will be executed,
> given the start- and stop-rows.
>
> > Then the Result should be:
> > Lars_72345,
> > Stack_64651.
> >
> >> Unfortunately I haven't found that much about when do to a scan and
> when to
> > do a Get
> > When you know what the rowkey exactly is, then do Get (Just like you use
> a
> > ID, an ID map to a single row, you fetch this row only).
> > You only know the prefix of rowkey, then do Scan. (Scan can return more
> > than one result)
> Thanks!
> Do you know when to use a RowFilter for Gets?
> I don't see why they exists (I think they were mentioned for Gets
> somewhere).
>
> >
> >> You mean instead of designing the key [blogId]_[timestamp]  I should do
> it
> > this way:  [timestamp]_[blogId]?
> > No no... As I said, 'Sort all fethed IndexRows', those fetched Rows are
> > Scan result above:
> > Lars_72345 and Stack_64651.
> > Sort them by time, than you get:
> > Stack_64651,
> > Lars_72345
>
> I do not see how this differs from my explanation? Or do we mean the
> same? :)
>
> You lead me to an idea of how to improve my scenario.
> One of the latest steps in my concept was to retrive the top N
> index-entries, sort them by their timestamp and do a Get for each of the
> N to retrive the N most recent blogposts for them.
> Well, there is a better option:
> Instead of returning the most recent entry for Lars by a Scan, I can
> retrive the top N most recent entries for Lars in the index-table.
> If I have to do my Scans recursively, since they are not executable in
> Batch out of the box, I could specify a better start-row for "Stack":
>
> Say Lars's third-most-recent blogpost's key in the index-table looks like:
>
> Lars_74244
>
> The top N for Stack have to start at
> Stack_74244, since I only care about blogposts by Stack that are younger
> than the ones I already have from Lars.
> I think that the Scan should return instantly if there are no rowkeys
> that fit this criteria, shouldn't it?
>
> Kind regards,
> Em
>
> >
> >
> >
> > 2012/6/6 Em <mailformailinglists@yahoo.de>
> >
> >> Thanks for your feedback!
> >>
> >>> 2.  Scan the newest from Index table for each subscribed blogID
> >> Am I able to do this with one scan?
> >> Since all my blogs are relevant, this could lead to a stop and start row
> >> with a range where almost every other blog in the database fits in
> >> (think of a blog starting with a and another blog starting with z - both
> >> are in my subscription list).
> >> I thought that filtering would only scale linearly with the size of the
> >> table itself?
> >>
> >> Unfortunately I haven't found that much about when do to a scan and when
> >> to do a Get. Especially if the keys are all starting differently.
> >>
> >>> 3.  Sort all fethed IndexRows by publication-date (Infect not sort all
> ,
> >>> but get the newest 3, may be faster)
> >> You mean instead of designing the key
> >> [blogId]_[timestamp]
> >> I should do it this way:
> >> [timestamp]_[blogId]?
> >>
> >> Well, it depends on your scenario.
> >> If you want to know the most recent blogposts globally or within a
> >> period of time you are absolutely right. But if you want to know it for
> >> a special user/blog limited on his subscription, this could be really
> >> slow if the most recent blog posts relevant to this user are relatively
> >> old.
> >>
> >> Did you mean that or something different?
> >>
> >> Kind regards,
> >> Em
> >>
> >> Am 05.06.2012 11:18, schrieb NNever:
> >>> Very clear now :).
> >>> Only one problem,
> >>>
> >>> blog {//this is t1 of my example
> >>>    blogposts {//the column family
> >>>       05.05.2012_something { the blog post },//this is a column
> >>>       06.05.2012_anything  { the blog post },
> >>>       05.06.2012_nothing   { the blog post }
> >>>    },
> >>> ...
> >>>
> >>> here,  05.05.2012_something may not support easily sort. you shoud
> fetch
> >>> out all posts and do sort, when post amount become huge, this can be
> >>>  terrible. Just change it to  reverseTime_postTitle, then you may use
> >>> ColumnPaginationFilter to easily fetch the newest 3 columns.
> >>>
> >>>
> >>> After all , when you want to get most recent subscribed posts for some
> >> one,
> >>> you do:
> >>> 1.  Get Blog.subscribed_blogs's all values(subscribed blogIDs)
> >>> ---------------------- get by row, not slow
> >>> 2.  Scan the newest from Index table for each subscribed blogID
> >>> -------------------- Scan with startRow and stopRow, not slow
> >>> 3.  Sort all fethed IndexRows by publication-date (Infect not sort all
> ,
> >>> but get the newest 3, may be faster) ---------------- not slow
> >>> 4.  Use the 3 blogIDs above, on table Blog, each get 3 newest columns
> on
> >> CF
> >>> blogposts ------------------ use dictionary-order, use
> >>> ColumnPaginationFilter, not slow, but i wonder how fast it will be
> >>> 5.  Compare those 9 posts, get newest 3 ------------ not slow
> >>> Overall, almost all search all use the rowkey, so the whole process may
> >> not
> >>> have much delay, I think.
> >>>
> >>>
> >>> Best Regards,
> >>> NN
> >>>
> >>>
> >>>
> >>> 2012/6/5 Em <mailformailinglists@yahoo.de>
> >>>
> >>>> Correction of my last sentences:
> >>>>> So, what happens if techchrunch writes a new blog posts?
> >>>>> It will create a new column in its row's blogposts-CF and trigger
a
> >>>>> million writes in the index-table (which only writes keys and empty
> >>>>> values of 0byte length - I assume that's the cheapest write I can
> do).
> >>>> Of course I mean it will NOT trigger a million writes in the index
> >>>> table, but only ONE write for this post.
> >>>>
> >>>> Kind regards,
> >>>> Em
> >>>>
> >>>> Am 05.06.2012 10:00, schrieb Em:
> >>>>> NN,
> >>>>>
> >>>>> thanks for the pointing to Coprocessors. I'll take a look on them!
> >>>>>
> >>>>> Okay, I see that my descriptions are confusing.
> >>>>>
> >>>>> Let me give you an example of some simplified entitys in my tables:
> >>>>>
> >>>>> blog {//this is t1 of my example
> >>>>>     blogposts {//the column family
> >>>>>        05.05.2012_something { the blog post },//this is a column
> >>>>>        06.05.2012_anything  { the blog post },
> >>>>>        05.06.2012_nothing   { the blog post }
> >>>>>     },
> >>>>>     subscribed_blogs {
> >>>>>        Wil_Wheaton's Blog { date_of_subscription },
> >>>>>        Sheldon's Blog     { date_of_subscription },
> >>>>>        Penny's Blog       { date_of_subscription },
> >>>>>        ... hundreds of other blogs ...
> >>>>>     }
> >>>>> }
> >>>>>
> >>>>> This blog has 3 blogposts. Each column of the user's blogposts
> >>>>> column-family contains a blogpost, where the column-name contains
the
> >>>>> date and the title. This way columns can be accessed ordered by
date.
> >>>>> Now this blog (or better say its author) is following some other
> blogs.
> >>>>> I do not want to get the posts of the subscribed blogs and write
it
> in
> >>>>> the blog's row (duplicating the posts of the followed blogs).
> >>>>> The reason is that you have too keep in synch with the original
> posts.
> >>>>> Furthermore a very popular blog could trigger millions of writes
(at
> >>>>> last one write per user). This is too much.
> >>>>>
> >>>>> So I want to build an index, t2. Let's call this table "index".
> >>>>>
> >>>>> index {
> >>>>>     dummy_column_family {
> >>>>>        dummy_column { i do only care about the rowkey. }
> >>>>>    }
> >>>>> }
> >>>>>
> >>>>> If a blog writes a new post, I'll write that post into the blogposts
> >>>>> table for the blog and additionally in the index table.
> >>>>> The rowkey would look like:
> >>>>> [blog_id]_[Long.MAX_VALUE - publication-date] (doing it this way,
> they
> >>>>> are sorted by HBase in LIFO-order).
> >>>>> Note: The publication date could be in the future! So it's not the
> date
> >>>>> of creation.
> >>>>>
> >>>>> Now, if I have a blog and I subscribed to 1.000 other blogs as well:
> To
> >>>>> generate a list of the most recent blog-posts of the blogs I
> >> subscribed,
> >>>>> I do the following:
> >>>>>
> >>>>> Read every column of my subscribed_blogs column family (they contain
> >> the
> >>>>> other blogs' ids).
> >>>>>
> >>>>> For each column, I want to do a lookup in my index-table (Scan or
> Get,
> >> I
> >>>>> am not sure what to use, since one may be able to batch the stuff):
> >>>>> Get: blog_id* (the "*" means that the rowkey should start with the
> >>>>> specified blog_id).
> >>>>> I want to fetch only the most recent per blog_id.
> >>>>> Now I have 1.000 rowkeys, each containing a blog_id and a timestamp.
> >>>>> Let's sort by timestamp and get the top 3 (maybe I can do some part
> of
> >>>>> this work on the server side).
> >>>>> I see that my top 3 list contains a post from Wil Wheaton, one from
> >>>>> Sheldon and another one from Techchrunch.
> >>>>>
> >>>>> Now I'll do three Gets in my blog-table:
> >>>>> One for Wil Wheaton's blog, another for Sheldon's and one for
> >>>> Techchrunch.
> >>>>> Since their columns are sorted by date, I'll fetch the latest 3
> >>>>> blog-posts of each blog, returning 9 blog-posts in sum.
> >>>>> Now I am able to sort these 9 blog-posts by their date and display
> the
> >>>>> top-3.
> >>>>>
> >>>>> Why do I fetch the top-3 of each blog and sort them again?
> >>>>> Well, if Wil Wheaton wrote a blog-post yesterday, techcrunch wrote
> one
> >>>>> two days ago and sheldon wrote two posts today and I want to get
the
> >>>>> three most recent posts of all my subscribed blogs, than techcrunch
> is
> >>>>> out of this list, since it has the 4th-most-recent blog-post.
> >>>>>
> >>>>> I hope the scenario is more clearer now.
> >>>>>
> >>>>> So, what happens if techchrunch writes a new blog posts?
> >>>>> It will create a new column in its row's blogposts-CF and trigger
a
> >>>>> million writes in the index-table (which only writes keys and empty
> >>>>> values of 0byte length - I assume that's the cheapest write I can
> do).
> >>>>>
> >>>>> Kind regards,
> >>>>> Em
> >>>>>
> >>>>>
> >>>>> Am 05.06.2012 08:07, schrieb NNever:
> >>>>>> 1. Endpoint is a kind of Coprocessor, it was added in 0.92.
You can
> >>>> though
> >>>>>> it a little like Relational-Database’s storedProcedure. It's
some
> >>>> logicals
> >>>>>> run on HBase server side. With it you may reduce your app's
RPC
> calls,
> >>>> or
> >>>>>> as you said,  reduce traffic .
> >>>>>> you can get some help on Coprocessor/Endpoint from here:
> >>>>>> https://blogs.apache.org/hbase/entry/coprocessor_introduction
> >>>>>> 2. I still a little confuse what exactly you want with this
table
> >> struct
> >>>>>> (Srry for that but my mother-language is not English).
> >>>>>> You mean t1 is the original data of some ojects,
> >>>>>> then t2 keep something about the object in t1?(like logs, 10:11
em
> >> check
> >>>>>> t1obj1; 10:13 em buy t1obj1; 10:30 em tookaway t1obj1)?
> >>>>>> 3. You said 'This data is then sorted by the time part of the
> returned
> >>>>>> rowkeys to get
> >>>>>> the Top N of these.'. Well there may be no necessary to do the
sort.
> >>>> HBase
> >>>>>> keeps data in dictionary-order. Then you just fetch N of them,
they
> >> are
> >>>>>> already ordered.
> >>>>>> 4. I use HBase not long , infectly still a nood on it :) . 
I would
> be
> >>>> glad
> >>>>>> anything can help you.
> >>>>>>
> >>>>>> Best Regards,
> >>>>>> NN
> >>>>>>
> >>>>>>
> >>>>>> 2012/6/5 Em <mailformailinglists@yahoo.de>
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> what do you mean by endpoint?
> >>>>>>>
> >>>>>>> It would look more like
> >>>>>>>
> >>>>>>> T2 {
> >>>>>>>   rowkey: t1_id-(Long.MAX_VALUE - time)
> >>>>>>>   {
> >>>>>>>      family: qualifier = dummyDataSinceOnlyTheRowkeyMatters
> >>>>>>>   }
> >>>>>>> }
> >>>>>>>
> >>>>>>> For every t1_id associated with a specific object, one gets
the
> >> newest
> >>>>>>> entry in the T2-table (newest in relation to the key, not
the
> >> internal
> >>>>>>> timestamp of creation).
> >>>>>>> This data is then sorted by the time part of the returned
rowkeys
> to
> >>>> get
> >>>>>>> the Top N of these.
> >>>>>>> And then you get N records from t1 again.
> >>>>>>>
> >>>>>>> At last, that's what I thought about, though I am not sure
that
> this
> >> is
> >>>>>>> the most efficient way.
> >>>>>>>
> >>>>>>> Kind regards,
> >>>>>>> Em
> >>>>>>>
> >>>>>>> Am 05.06.2012 04:33, schrieb NNever:
> >>>>>>>> Does the Schema like this:
> >>>>>>>>
> >>>>>>>> T2{
> >>>>>>>>   rowkey: rs-time row
> >>>>>>>>    {
> >>>>>>>>        family:qualifier =  t1's row
> >>>>>>>>    }
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> Then you Scan the newest 1000 from T2, and each get
it's t1Row,
> then
> >>>> do
> >>>>>>>> 1000 Gets from T1 for one page?
> >>>>>>>>
> >>>>>>>> 2012/6/5 NNever <nneverwei@gmail.com>
> >>>>>>>>
> >>>>>>>>> '- I'd like to do the top N stuff on the server
side to reduce
> >>>> traffic,
> >>>>>>>>> will this be possible? '
> >>>>>>>>>
> >>>>>>>>> Endpoint?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 2012/6/5 Em <mailformailinglists@yahoo.de>
> >>>>>>>>>
> >>>>>>>>>> Hello list,
> >>>>>>>>>>
> >>>>>>>>>> let's say I have to fetch a lot of rows for
a page-request (say
> >>>>>>>>>> 1.000-2.000).
> >>>>>>>>>> The row-keys are a composition of a fixed id
of an object and a
> >>>>>>>>>> sequential ever-increasing id. Salting those
keys for balancing
> >> may
> >>>> be
> >>>>>>>>>> taken into consideration.
> >>>>>>>>>>
> >>>>>>>>>> I want to do a Join like this one expressed
in SQL:
> >>>>>>>>>>
> >>>>>>>>>> SELECT t1.columns FROM t1
> >>>>>>>>>> JOIN t2 ON (t1.id = t2.id)
> >>>>>>>>>> WHERE t2.id = fixedID-prefix
> >>>>>>>>>>
> >>>>>>>>>> I know that HBase does not support that out
of the box.
> >>>>>>>>>> My approach is to have all the fixed-ids as
columns of a row in
> >> t1.
> >>>>>>>>>> Selecting a row, I fetch those columns that
are of interest for
> >> me,
> >>>>>>>>>> where each column contains a fixedID for t2.
> >>>>>>>>>> Now I do a scan on t2 for each fixedID which
should return me
> >>>> exactly
> >>>>>>>>>> one value per fixedID (it's kind of a reverse-timestamp-approach
> >>>> like
> >>>>>>> in
> >>>>>>>>>> the HBase-book).
> >>>>>>>>>> Furthermore I am really only interested in the
key itself. I
> don't
> >>>> care
> >>>>>>>>>> about the columns (t2 is more like an index).
> >>>>>>>>>> Having fetched a row per fixedID, I sort based
on the sequential
> >>>> part
> >>>>>>> of
> >>>>>>>>>> their key and get the top N.
> >>>>>>>>>> For those top N I'll fetch data from t1.
> >>>>>>>>>>
> >>>>>>>>>> The usecase is to fetch the top N most recent
entitys of t1 that
> >> are
> >>>>>>>>>> associated with a specific entity in t1 by using
t2 as an index.
> >>>>>>>>>> T2 has one extra benefit over t1: You can do
range-scans, if
> >>>>>>> neccessary.
> >>>>>>>>>>
> >>>>>>>>>> Questions:
> >>>>>>>>>> - since this is triggered by a page-request:
Will this return
> with
> >>>> low
> >>>>>>>>>> latency?
> >>>>>>>>>> - is there a possibility to do those Scans in
a batch? Maybe I
> can
> >>>>>>>>>> combine them into one big scanner, using a custom
filter for
> what
> >> I
> >>>>>>> want?
> >>>>>>>>>> - do you have thoughts on improving this type
of request?
> >>>>>>>>>> - I'd like to do the top N stuff on the server
side to reduce
> >>>> traffic,
> >>>>>>>>>> will this be possible?
> >>>>>>>>>> - I am not sure whether a Scan is really what
I want. Maybe a
> >>>> Multiget
> >>>>>>>>>> will fit my needs better combined with a RowFilter?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I really work hard on finding the best approach
of mapping this
> >>>>>>>>>> m:n-relation to a HBase schema - so any help
is appreciated.
> >>>>>>>>>>
> >>>>>>>>>> Please note: I haven't written any line of HBase
code so far.
> >>>> Currently
> >>>>>>>>>> I am studying books, blog-posts, slides and
the mailinglists for
> >>>>>>>>>> learning more about HBase.
> >>>>>>>>>>
> >>>>>>>>>> Thanks!
> >>>>>>>>>>
> >>>>>>>>>> Kind regards,
> >>>>>>>>>> Em
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message