Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 98976 invoked from network); 14 Feb 2011 17:28:59 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 14 Feb 2011 17:28:59 -0000 Received: (qmail 55985 invoked by uid 500); 14 Feb 2011 17:28:58 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 55887 invoked by uid 500); 14 Feb 2011 17:28:56 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 55879 invoked by uid 99); 14 Feb 2011 17:28:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Feb 2011 17:28:55 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.216.176] (HELO mail-qy0-f176.google.com) (209.85.216.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Feb 2011 17:28:48 +0000 Received: by qyk10 with SMTP id 10so3737986qyk.14 for ; Mon, 14 Feb 2011 09:28:27 -0800 (PST) Received: by 10.224.19.206 with SMTP id c14mr3386577qab.219.1297704506606; Mon, 14 Feb 2011 09:28:26 -0800 (PST) MIME-Version: 1.0 Received: by 10.224.67.78 with HTTP; Mon, 14 Feb 2011 09:28:06 -0800 (PST) In-Reply-To: References: From: Bruno Dumon Date: Mon, 14 Feb 2011 18:28:06 +0100 Message-ID: Subject: Re: HBase and Lucene for realtime search To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=0015175cba6623c742049c416098 X-Virus-Checked: Checked by ClamAV on apache.org --0015175cba6623c742049c416098 Content-Type: text/plain; charset=ISO-8859-1 On Mon, Feb 14, 2011 at 12:37 AM, Jason Rutherglen < jason.rutherglen@gmail.com> wrote: > > Another issue is that maybe the scalability needs for search might be > > different. An HBase region is always only active in one region server, > there > > are no active replica's, while often for search you need replicas to > scale, > > since a search will typically hit all partitions. > > > Really? That seems odd. > Yep, really. The replication is [only] on the HDFS-level. For HBase, this is not much of a problem as long as the requests are not strongly skewed towards one region (this requires good consideration from users when choosing row keys), but for search this could be a real issue. Also, HBase and Lucene might be different in how much rows/documents they can handle on one server, or in one region (an HBase region is typically only 256MB), leading to difficult choices (optimize region size for hbase vs for lucene). > > to be the main action and all what follows just secondary side-effects > (i.e. > > there's no rollback). > > I think inside a Coprocessor you could block the HBase 'commit' until > a successful updateDoc call to Lucene (which is only an update to RAM > anyways)? > Yes, that should work. But doesn't it assume that the index is updated synchronously with the HBase row? I can imagine this will sometimes be an issue, e.g. if it would involve performing expensive content extraction (tika) or analysis. BTW, something we do in Lily, and which might be interesting to think about in this context as well, is denormalization, thus in the Lucene document of some HBase row information is stored from related (linked) rows. This requires that, when one row changes, you need to find out what other rows denormalize info from this row, and update the Lucene documents of those rows as well. Just bringing this up as a random feature to think about ;-) -- Bruno Dumon Outerthought http://outerthought.org/ --0015175cba6623c742049c416098--