Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of nneverwei@gmail.com designates
 209.85.214.169 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <4FCDBD58.8050806@yahoo.de>
References: <4FCD20B8.20002@yahoo.de>
 <CAPoK-vnv0FXxYNsujn9-p=bOxG=MehQoDBLe+5UDuPCe8PPsdQ@mail.gmail.com>
 <CAPoK-v=ZSXy231eW7=X5KSABEp0bFJ0kgBcQQquUciCbrN3JHA@mail.gmail.com>
 <4FCD9006.8050203@yahoo.de>
 <CAPoK-vn0ep7hQVRqb1HBvpXfKdfq5Sk-2=m9WnuAhnVL=rqf3g@mail.gmail.com>
 <4FCDBC95.1000009@yahoo.de> <4FCDBD58.8050806@yahoo.de>
From: NNever <nneverwei@gmail.com>
Date: Tue, 5 Jun 2012 17:18:43 +0800
Message-ID: 
 <CAPoK-v=qvbAm3ciQVA+GdFvr9gJc02Q-jU8f0rivWgOwMToaBQ@mail.gmail.com>
Subject: Re: Scan triggered per page-request, performance-impacts?
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=e89a8ff2518255839804c1b6248b

--e89a8ff2518255839804c1b6248b
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Very clear now :).
Only one problem,

blog {//this is t1 of my example
   blogposts {//the column family
      05.05.2012_something { the blog post },//this is a column
      06.05.2012_anything  { the blog post },
      05.06.2012_nothing   { the blog post }
   },
...

here,  05.05.2012_something may not support easily sort. you shoud fetch
out all posts and do sort, when post amount become huge, this can be
 terrible. Just change it to  reverseTime_postTitle, then you may use
ColumnPaginationFilter to easily fetch the newest 3 columns.


After all , when you want to get most recent subscribed posts for some one,
you do:
1.  Get Blog.subscribed_blogs's all values(subscribed blogIDs)
---------------------- get by row, not slow
2.  Scan the newest from Index table for each subscribed blogID
-------------------- Scan with startRow and stopRow, not slow
3.  Sort all fethed IndexRows by publication-date (Infect not sort all ,
but get the newest 3, may be faster) ---------------- not slow
4.  Use the 3 blogIDs above, on table Blog, each get 3 newest columns on CF
blogposts ------------------ use dictionary-order, use
ColumnPaginationFilter, not slow, but i wonder how fast it will be
5.  Compare those 9 posts, get newest 3 ------------ not slow
Overall, almost all search all use the rowkey, so the whole process may not
have much delay, I think.


Best Regards,
NN


2012/6/5 Em <mailformailinglists@yahoo.de>

> Correction of my last sentences:
> > So, what happens if techchrunch writes a new blog posts?
> > It will create a new column in its row's blogposts-CF and trigger a
> > million writes in the index-table (which only writes keys and empty
> > values of 0byte length - I assume that's the cheapest write I can do).
> Of course I mean it will NOT trigger a million writes in the index
> table, but only ONE write for this post.
>
> Kind regards,
> Em
>
> Am 05.06.2012 10:00, schrieb Em:
> > NN,
> >
> > thanks for the pointing to Coprocessors. I'll take a look on them!
> >
> > Okay, I see that my descriptions are confusing.
> >
> > Let me give you an example of some simplified entitys in my tables:
> >
> > blog {//this is t1 of my example
> >     blogposts {//the column family
> >        05.05.2012_something { the blog post },//this is a column
> >        06.05.2012_anything  { the blog post },
> >        05.06.2012_nothing   { the blog post }
> >     },
> >     subscribed_blogs {
> >        Wil_Wheaton's Blog { date_of_subscription },
> >        Sheldon's Blog     { date_of_subscription },
> >        Penny's Blog       { date_of_subscription },
> >        ... hundreds of other blogs ...
> >     }
> > }
> >
> > This blog has 3 blogposts. Each column of the user's blogposts
> > column-family contains a blogpost, where the column-name contains the
> > date and the title. This way columns can be accessed ordered by date.
> > Now this blog (or better say its author) is following some other blogs.
> > I do not want to get the posts of the subscribed blogs and write it in
> > the blog's row (duplicating the posts of the followed blogs).
> > The reason is that you have too keep in synch with the original posts.
> > Furthermore a very popular blog could trigger millions of writes (at
> > last one write per user). This is too much.
> >
> > So I want to build an index, t2. Let's call this table "index".
> >
> > index {
> >     dummy_column_family {
> >        dummy_column { i do only care about the rowkey. }
> >    }
> > }
> >
> > If a blog writes a new post, I'll write that post into the blogposts
> > table for the blog and additionally in the index table.
> > The rowkey would look like:
> > [blog_id]_[Long.MAX_VALUE - publication-date] (doing it this way, they
> > are sorted by HBase in LIFO-order).
> > Note: The publication date could be in the future! So it's not the date
> > of creation.
> >
> > Now, if I have a blog and I subscribed to 1.000 other blogs as well: To
> > generate a list of the most recent blog-posts of the blogs I subscribed=
,
> > I do the following:
> >
> > Read every column of my subscribed_blogs column family (they contain th=
e
> > other blogs' ids).
> >
> > For each column, I want to do a lookup in my index-table (Scan or Get, =
I
> > am not sure what to use, since one may be able to batch the stuff):
> > Get: blog_id* (the "*" means that the rowkey should start with the
> > specified blog_id).
> > I want to fetch only the most recent per blog_id.
> > Now I have 1.000 rowkeys, each containing a blog_id and a timestamp.
> > Let's sort by timestamp and get the top 3 (maybe I can do some part of
> > this work on the server side).
> > I see that my top 3 list contains a post from Wil Wheaton, one from
> > Sheldon and another one from Techchrunch.
> >
> > Now I'll do three Gets in my blog-table:
> > One for Wil Wheaton's blog, another for Sheldon's and one for
> Techchrunch.
> > Since their columns are sorted by date, I'll fetch the latest 3
> > blog-posts of each blog, returning 9 blog-posts in sum.
> > Now I am able to sort these 9 blog-posts by their date and display the
> > top-3.
> >
> > Why do I fetch the top-3 of each blog and sort them again?
> > Well, if Wil Wheaton wrote a blog-post yesterday, techcrunch wrote one
> > two days ago and sheldon wrote two posts today and I want to get the
> > three most recent posts of all my subscribed blogs, than techcrunch is
> > out of this list, since it has the 4th-most-recent blog-post.
> >
> > I hope the scenario is more clearer now.
> >
> > So, what happens if techchrunch writes a new blog posts?
> > It will create a new column in its row's blogposts-CF and trigger a
> > million writes in the index-table (which only writes keys and empty
> > values of 0byte length - I assume that's the cheapest write I can do).
> >
> > Kind regards,
> > Em
> >
> >
> > Am 05.06.2012 08:07, schrieb NNever:
> >> 1. Endpoint is a kind of Coprocessor, it was added in 0.92. You can
> though
> >> it a little like Relational-Database=E2=80=99s storedProcedure. It's s=
ome
> logicals
> >> run on HBase server side. With it you may reduce your app's RPC calls,
> or
> >> as you said,  reduce traffic .
> >> you can get some help on Coprocessor/Endpoint from here:
> >> https://blogs.apache.org/hbase/entry/coprocessor_introduction
> >> 2. I still a little confuse what exactly you want with this table stru=
ct
> >> (Srry for that but my mother-language is not English).
> >> You mean t1 is the original data of some ojects,
> >> then t2 keep something about the object in t1?(like logs, 10:11 em che=
ck
> >> t1obj1; 10:13 em buy t1obj1; 10:30 em tookaway t1obj1)?
> >> 3. You said 'This data is then sorted by the time part of the returned
> >> rowkeys to get
> >> the Top N of these.'. Well there may be no necessary to do the sort.
> HBase
> >> keeps data in dictionary-order. Then you just fetch N of them, they ar=
e
> >> already ordered.
> >> 4. I use HBase not long , infectly still a nood on it :) .  I would be
> glad
> >> anything can help you.
> >>
> >> Best Regards,
> >> NN
> >>
> >>
> >> 2012/6/5 Em <mailformailinglists@yahoo.de>
> >>
> >>> Hi,
> >>>
> >>> what do you mean by endpoint?
> >>>
> >>> It would look more like
> >>>
> >>> T2 {
> >>>   rowkey: t1_id-(Long.MAX_VALUE - time)
> >>>   {
> >>>      family: qualifier =3D dummyDataSinceOnlyTheRowkeyMatters
> >>>   }
> >>> }
> >>>
> >>> For every t1_id associated with a specific object, one gets the newes=
t
> >>> entry in the T2-table (newest in relation to the key, not the interna=
l
> >>> timestamp of creation).
> >>> This data is then sorted by the time part of the returned rowkeys to
> get
> >>> the Top N of these.
> >>> And then you get N records from t1 again.
> >>>
> >>> At last, that's what I thought about, though I am not sure that this =
is
> >>> the most efficient way.
> >>>
> >>> Kind regards,
> >>> Em
> >>>
> >>> Am 05.06.2012 04:33, schrieb NNever:
> >>>> Does the Schema like this:
> >>>>
> >>>> T2{
> >>>>   rowkey: rs-time row
> >>>>    {
> >>>>        family:qualifier =3D  t1's row
> >>>>    }
> >>>> }
> >>>>
> >>>> Then you Scan the newest 1000 from T2, and each get it's t1Row, then
> do
> >>>> 1000 Gets from T1 for one page?
> >>>>
> >>>> 2012/6/5 NNever <nneverwei@gmail.com>
> >>>>
> >>>>> '- I'd like to do the top N stuff on the server side to reduce
> traffic,
> >>>>> will this be possible? '
> >>>>>
> >>>>> Endpoint?
> >>>>>
> >>>>>
> >>>>> 2012/6/5 Em <mailformailinglists@yahoo.de>
> >>>>>
> >>>>>> Hello list,
> >>>>>>
> >>>>>> let's say I have to fetch a lot of rows for a page-request (say
> >>>>>> 1.000-2.000).
> >>>>>> The row-keys are a composition of a fixed id of an object and a
> >>>>>> sequential ever-increasing id. Salting those keys for balancing ma=
y
> be
> >>>>>> taken into consideration.
> >>>>>>
> >>>>>> I want to do a Join like this one expressed in SQL:
> >>>>>>
> >>>>>> SELECT t1.columns FROM t1
> >>>>>> JOIN t2 ON (t1.id =3D t2.id)
> >>>>>> WHERE t2.id =3D fixedID-prefix
> >>>>>>
> >>>>>> I know that HBase does not support that out of the box.
> >>>>>> My approach is to have all the fixed-ids as columns of a row in t1=
.
> >>>>>> Selecting a row, I fetch those columns that are of interest for me=
,
> >>>>>> where each column contains a fixedID for t2.
> >>>>>> Now I do a scan on t2 for each fixedID which should return me
> exactly
> >>>>>> one value per fixedID (it's kind of a reverse-timestamp-approach
> like
> >>> in
> >>>>>> the HBase-book).
> >>>>>> Furthermore I am really only interested in the key itself. I don't
> care
> >>>>>> about the columns (t2 is more like an index).
> >>>>>> Having fetched a row per fixedID, I sort based on the sequential
> part
> >>> of
> >>>>>> their key and get the top N.
> >>>>>> For those top N I'll fetch data from t1.
> >>>>>>
> >>>>>> The usecase is to fetch the top N most recent entitys of t1 that a=
re
> >>>>>> associated with a specific entity in t1 by using t2 as an index.
> >>>>>> T2 has one extra benefit over t1: You can do range-scans, if
> >>> neccessary.
> >>>>>>
> >>>>>> Questions:
> >>>>>> - since this is triggered by a page-request: Will this return with
> low
> >>>>>> latency?
> >>>>>> - is there a possibility to do those Scans in a batch? Maybe I can
> >>>>>> combine them into one big scanner, using a custom filter for what =
I
> >>> want?
> >>>>>> - do you have thoughts on improving this type of request?
> >>>>>> - I'd like to do the top N stuff on the server side to reduce
> traffic,
> >>>>>> will this be possible?
> >>>>>> - I am not sure whether a Scan is really what I want. Maybe a
> Multiget
> >>>>>> will fit my needs better combined with a RowFilter?
> >>>>>>
> >>>>>>
> >>>>>> I really work hard on finding the best approach of mapping this
> >>>>>> m:n-relation to a HBase schema - so any help is appreciated.
> >>>>>>
> >>>>>> Please note: I haven't written any line of HBase code so far.
> Currently
> >>>>>> I am studying books, blog-posts, slides and the mailinglists for
> >>>>>> learning more about HBase.
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
> >>>>>> Kind regards,
> >>>>>> Em
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
>

--e89a8ff2518255839804c1b6248b--