Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 19F62C55E for ; Tue, 5 Jun 2012 09:19:38 +0000 (UTC) Received: (qmail 87303 invoked by uid 500); 5 Jun 2012 09:19:36 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 86781 invoked by uid 500); 5 Jun 2012 09:19:35 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 86755 invoked by uid 99); 5 Jun 2012 09:19:34 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Jun 2012 09:19:34 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of nneverwei@gmail.com designates 209.85.214.169 as permitted sender) Received: from [209.85.214.169] (HELO mail-ob0-f169.google.com) (209.85.214.169) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Jun 2012 09:19:26 +0000 Received: by obbwd18 with SMTP id wd18so11153009obb.14 for ; Tue, 05 Jun 2012 02:19:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=3pNi/0+7MdKH6SkodccbQeHaWxPrMCCl43PRS8Bz3fM=; b=KanSOvQ8zrHSsnui5YEoN809k+MivefI09dRkVC/LS9RpBxtEUd2FSIprvYPfQzwy6 gc/4BNTOBrpxnRlTDC6N8aByd6ythtC+pDTmx5m5RRxsD4EAOJBKO8skcYfFCDc9SxS9 V0LchvUz9uf3Y1cMPyNWVng8U9j5fI0cwOwWk8JYpetz1ROYV/hBD8fqlsJTr5q0fQ8R zuuc3ft1bhJ6mmAByR1mGx0+OzZu9RH6RGpyNg6sDRXzMeLrEqcP4DjR0xMvzIh7I8bD 1LUHuZR8T8YlqziJ4nxI1inj0t2sIiOhdJHpIdcIamk8+0uks6rrG2X6Q1SbxlFsHuus DcFg== Received: by 10.60.25.100 with SMTP id b4mr14748981oeg.64.1338887944635; Tue, 05 Jun 2012 02:19:04 -0700 (PDT) MIME-Version: 1.0 Received: by 10.76.167.97 with HTTP; Tue, 5 Jun 2012 02:18:43 -0700 (PDT) In-Reply-To: <4FCDBD58.8050806@yahoo.de> References: <4FCD20B8.20002@yahoo.de> <4FCD9006.8050203@yahoo.de> <4FCDBC95.1000009@yahoo.de> <4FCDBD58.8050806@yahoo.de> From: NNever Date: Tue, 5 Jun 2012 17:18:43 +0800 Message-ID: Subject: Re: Scan triggered per page-request, performance-impacts? To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=e89a8ff2518255839804c1b6248b --e89a8ff2518255839804c1b6248b Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Very clear now :). Only one problem, blog {//this is t1 of my example blogposts {//the column family 05.05.2012_something { the blog post },//this is a column 06.05.2012_anything { the blog post }, 05.06.2012_nothing { the blog post } }, ... here, 05.05.2012_something may not support easily sort. you shoud fetch out all posts and do sort, when post amount become huge, this can be terrible. Just change it to reverseTime_postTitle, then you may use ColumnPaginationFilter to easily fetch the newest 3 columns. After all , when you want to get most recent subscribed posts for some one, you do: 1. Get Blog.subscribed_blogs's all values(subscribed blogIDs) ---------------------- get by row, not slow 2. Scan the newest from Index table for each subscribed blogID -------------------- Scan with startRow and stopRow, not slow 3. Sort all fethed IndexRows by publication-date (Infect not sort all , but get the newest 3, may be faster) ---------------- not slow 4. Use the 3 blogIDs above, on table Blog, each get 3 newest columns on CF blogposts ------------------ use dictionary-order, use ColumnPaginationFilter, not slow, but i wonder how fast it will be 5. Compare those 9 posts, get newest 3 ------------ not slow Overall, almost all search all use the rowkey, so the whole process may not have much delay, I think. Best Regards, NN 2012/6/5 Em > Correction of my last sentences: > > So, what happens if techchrunch writes a new blog posts? > > It will create a new column in its row's blogposts-CF and trigger a > > million writes in the index-table (which only writes keys and empty > > values of 0byte length - I assume that's the cheapest write I can do). > Of course I mean it will NOT trigger a million writes in the index > table, but only ONE write for this post. > > Kind regards, > Em > > Am 05.06.2012 10:00, schrieb Em: > > NN, > > > > thanks for the pointing to Coprocessors. I'll take a look on them! > > > > Okay, I see that my descriptions are confusing. > > > > Let me give you an example of some simplified entitys in my tables: > > > > blog {//this is t1 of my example > > blogposts {//the column family > > 05.05.2012_something { the blog post },//this is a column > > 06.05.2012_anything { the blog post }, > > 05.06.2012_nothing { the blog post } > > }, > > subscribed_blogs { > > Wil_Wheaton's Blog { date_of_subscription }, > > Sheldon's Blog { date_of_subscription }, > > Penny's Blog { date_of_subscription }, > > ... hundreds of other blogs ... > > } > > } > > > > This blog has 3 blogposts. Each column of the user's blogposts > > column-family contains a blogpost, where the column-name contains the > > date and the title. This way columns can be accessed ordered by date. > > Now this blog (or better say its author) is following some other blogs. > > I do not want to get the posts of the subscribed blogs and write it in > > the blog's row (duplicating the posts of the followed blogs). > > The reason is that you have too keep in synch with the original posts. > > Furthermore a very popular blog could trigger millions of writes (at > > last one write per user). This is too much. > > > > So I want to build an index, t2. Let's call this table "index". > > > > index { > > dummy_column_family { > > dummy_column { i do only care about the rowkey. } > > } > > } > > > > If a blog writes a new post, I'll write that post into the blogposts > > table for the blog and additionally in the index table. > > The rowkey would look like: > > [blog_id]_[Long.MAX_VALUE - publication-date] (doing it this way, they > > are sorted by HBase in LIFO-order). > > Note: The publication date could be in the future! So it's not the date > > of creation. > > > > Now, if I have a blog and I subscribed to 1.000 other blogs as well: To > > generate a list of the most recent blog-posts of the blogs I subscribed= , > > I do the following: > > > > Read every column of my subscribed_blogs column family (they contain th= e > > other blogs' ids). > > > > For each column, I want to do a lookup in my index-table (Scan or Get, = I > > am not sure what to use, since one may be able to batch the stuff): > > Get: blog_id* (the "*" means that the rowkey should start with the > > specified blog_id). > > I want to fetch only the most recent per blog_id. > > Now I have 1.000 rowkeys, each containing a blog_id and a timestamp. > > Let's sort by timestamp and get the top 3 (maybe I can do some part of > > this work on the server side). > > I see that my top 3 list contains a post from Wil Wheaton, one from > > Sheldon and another one from Techchrunch. > > > > Now I'll do three Gets in my blog-table: > > One for Wil Wheaton's blog, another for Sheldon's and one for > Techchrunch. > > Since their columns are sorted by date, I'll fetch the latest 3 > > blog-posts of each blog, returning 9 blog-posts in sum. > > Now I am able to sort these 9 blog-posts by their date and display the > > top-3. > > > > Why do I fetch the top-3 of each blog and sort them again? > > Well, if Wil Wheaton wrote a blog-post yesterday, techcrunch wrote one > > two days ago and sheldon wrote two posts today and I want to get the > > three most recent posts of all my subscribed blogs, than techcrunch is > > out of this list, since it has the 4th-most-recent blog-post. > > > > I hope the scenario is more clearer now. > > > > So, what happens if techchrunch writes a new blog posts? > > It will create a new column in its row's blogposts-CF and trigger a > > million writes in the index-table (which only writes keys and empty > > values of 0byte length - I assume that's the cheapest write I can do). > > > > Kind regards, > > Em > > > > > > Am 05.06.2012 08:07, schrieb NNever: > >> 1. Endpoint is a kind of Coprocessor, it was added in 0.92. You can > though > >> it a little like Relational-Database=E2=80=99s storedProcedure. It's s= ome > logicals > >> run on HBase server side. With it you may reduce your app's RPC calls, > or > >> as you said, reduce traffic . > >> you can get some help on Coprocessor/Endpoint from here: > >> https://blogs.apache.org/hbase/entry/coprocessor_introduction > >> 2. I still a little confuse what exactly you want with this table stru= ct > >> (Srry for that but my mother-language is not English). > >> You mean t1 is the original data of some ojects, > >> then t2 keep something about the object in t1?(like logs, 10:11 em che= ck > >> t1obj1; 10:13 em buy t1obj1; 10:30 em tookaway t1obj1)? > >> 3. You said 'This data is then sorted by the time part of the returned > >> rowkeys to get > >> the Top N of these.'. Well there may be no necessary to do the sort. > HBase > >> keeps data in dictionary-order. Then you just fetch N of them, they ar= e > >> already ordered. > >> 4. I use HBase not long , infectly still a nood on it :) . I would be > glad > >> anything can help you. > >> > >> Best Regards, > >> NN > >> > >> > >> 2012/6/5 Em > >> > >>> Hi, > >>> > >>> what do you mean by endpoint? > >>> > >>> It would look more like > >>> > >>> T2 { > >>> rowkey: t1_id-(Long.MAX_VALUE - time) > >>> { > >>> family: qualifier =3D dummyDataSinceOnlyTheRowkeyMatters > >>> } > >>> } > >>> > >>> For every t1_id associated with a specific object, one gets the newes= t > >>> entry in the T2-table (newest in relation to the key, not the interna= l > >>> timestamp of creation). > >>> This data is then sorted by the time part of the returned rowkeys to > get > >>> the Top N of these. > >>> And then you get N records from t1 again. > >>> > >>> At last, that's what I thought about, though I am not sure that this = is > >>> the most efficient way. > >>> > >>> Kind regards, > >>> Em > >>> > >>> Am 05.06.2012 04:33, schrieb NNever: > >>>> Does the Schema like this: > >>>> > >>>> T2{ > >>>> rowkey: rs-time row > >>>> { > >>>> family:qualifier =3D t1's row > >>>> } > >>>> } > >>>> > >>>> Then you Scan the newest 1000 from T2, and each get it's t1Row, then > do > >>>> 1000 Gets from T1 for one page? > >>>> > >>>> 2012/6/5 NNever > >>>> > >>>>> '- I'd like to do the top N stuff on the server side to reduce > traffic, > >>>>> will this be possible? ' > >>>>> > >>>>> Endpoint? > >>>>> > >>>>> > >>>>> 2012/6/5 Em > >>>>> > >>>>>> Hello list, > >>>>>> > >>>>>> let's say I have to fetch a lot of rows for a page-request (say > >>>>>> 1.000-2.000). > >>>>>> The row-keys are a composition of a fixed id of an object and a > >>>>>> sequential ever-increasing id. Salting those keys for balancing ma= y > be > >>>>>> taken into consideration. > >>>>>> > >>>>>> I want to do a Join like this one expressed in SQL: > >>>>>> > >>>>>> SELECT t1.columns FROM t1 > >>>>>> JOIN t2 ON (t1.id =3D t2.id) > >>>>>> WHERE t2.id =3D fixedID-prefix > >>>>>> > >>>>>> I know that HBase does not support that out of the box. > >>>>>> My approach is to have all the fixed-ids as columns of a row in t1= . > >>>>>> Selecting a row, I fetch those columns that are of interest for me= , > >>>>>> where each column contains a fixedID for t2. > >>>>>> Now I do a scan on t2 for each fixedID which should return me > exactly > >>>>>> one value per fixedID (it's kind of a reverse-timestamp-approach > like > >>> in > >>>>>> the HBase-book). > >>>>>> Furthermore I am really only interested in the key itself. I don't > care > >>>>>> about the columns (t2 is more like an index). > >>>>>> Having fetched a row per fixedID, I sort based on the sequential > part > >>> of > >>>>>> their key and get the top N. > >>>>>> For those top N I'll fetch data from t1. > >>>>>> > >>>>>> The usecase is to fetch the top N most recent entitys of t1 that a= re > >>>>>> associated with a specific entity in t1 by using t2 as an index. > >>>>>> T2 has one extra benefit over t1: You can do range-scans, if > >>> neccessary. > >>>>>> > >>>>>> Questions: > >>>>>> - since this is triggered by a page-request: Will this return with > low > >>>>>> latency? > >>>>>> - is there a possibility to do those Scans in a batch? Maybe I can > >>>>>> combine them into one big scanner, using a custom filter for what = I > >>> want? > >>>>>> - do you have thoughts on improving this type of request? > >>>>>> - I'd like to do the top N stuff on the server side to reduce > traffic, > >>>>>> will this be possible? > >>>>>> - I am not sure whether a Scan is really what I want. Maybe a > Multiget > >>>>>> will fit my needs better combined with a RowFilter? > >>>>>> > >>>>>> > >>>>>> I really work hard on finding the best approach of mapping this > >>>>>> m:n-relation to a HBase schema - so any help is appreciated. > >>>>>> > >>>>>> Please note: I haven't written any line of HBase code so far. > Currently > >>>>>> I am studying books, blog-posts, slides and the mailinglists for > >>>>>> learning more about HBase. > >>>>>> > >>>>>> Thanks! > >>>>>> > >>>>>> Kind regards, > >>>>>> Em > >>>>>> > >>>>> > >>>>> > >>>> > >>> > >> > --e89a8ff2518255839804c1b6248b--