Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <AANLkTi=aCw5m-ebw18jhCWLghhptbjT0uMYamkET__Xt@mail.gmail.com>
References: <AANLkTinZwuyBea=bM2EP0A2hEBDBuRCwutowmOqWWZ5D@mail.gmail.com>
 <AANLkTikyLxgr+-ymLRzhtw+wo0dWgYZ2z5HFNH0hbfpK@mail.gmail.com>
 <AANLkTi=KhTOzhbhGfHYqfCTvV8CTRDqoLRRm9UcHApqy@mail.gmail.com>
 <AANLkTinp6dGHuq5PThdn8VXi+B2PmtKWEaW4K43Dmp+t@mail.gmail.com>
 <AANLkTin5cFdwEszJDMw0_recTGupVgg3-2unuBP=uAsi@mail.gmail.com>
 <AANLkTin1RXkJ0dVHOMVHtpYHjyeCakBxw8QbnhE3+YTx@mail.gmail.com>
 <AANLkTikoGnHPCiDpTpZwz6T+gB45dt0-vT4rqTRKU8-3@mail.gmail.com>
 <AANLkTinW6tPAB0kJg4LS4kMtNmrs8_wiVdAbLxhVGhGv@mail.gmail.com>
 <AANLkTi==4oL-m06yx1bewoJr+LU9-auYKLDB+hn5MM4u@mail.gmail.com>
 <AANLkTi=oxOaqXz9YAQXMkzFmCWQcjFF_Fvgwa0MwV6v+@mail.gmail.com>
 <AANLkTimxqP1GXOXTeXuWPc8dKHdkzi1PWai9oi3Vcx0M@mail.gmail.com>
 <AANLkTi=QhiQDgOsArUFsAEcgPcAWtObqPYvTNLYcpgRW@mail.gmail.com>
 <AANLkTi=aCw5m-ebw18jhCWLghhptbjT0uMYamkET__Xt@mail.gmail.com>
From: Bruno Dumon <bruno@outerthought.org>
Date: Mon, 14 Feb 2011 18:28:06 +0100
Message-ID: <AANLkTikP9RCACfD0-BWgDA4aSkgy7QOwM73UnzSyEWLN@mail.gmail.com>
Subject: Re: HBase and Lucene for realtime search
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=0015175cba6623c742049c416098

--0015175cba6623c742049c416098
Content-Type: text/plain; charset=ISO-8859-1

On Mon, Feb 14, 2011 at 12:37 AM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

>  > Another issue is that maybe the scalability needs for search might be
> > different. An HBase region is always only active in one region server,
> there
> > are no active replica's, while often for search you need replicas to
> scale,
> > since a search will typically hit all partitions.
>
>
> Really?  That seems odd.
>

Yep, really. The replication is [only] on the HDFS-level. For HBase, this is
not much of a problem as long as the requests are not strongly skewed
towards one region (this requires good consideration from users when
choosing row keys), but for search this could be a real issue.

Also, HBase and Lucene might be different in how much rows/documents they
can handle on one server, or in one region (an HBase region is typically
only 256MB), leading to difficult choices (optimize region size for hbase vs
for lucene).


> > to be the main action and all what follows just secondary side-effects
> (i.e.
> > there's no rollback).
>
> I think inside a Coprocessor you could block the HBase 'commit' until
> a successful updateDoc call to Lucene (which is only an update to RAM
> anyways)?
>

Yes, that should work. But doesn't it assume that the index is updated
synchronously with the HBase row? I can imagine this will sometimes be an
issue, e.g. if it would involve performing expensive content extraction
(tika) or analysis.

BTW, something we do in Lily, and which might be interesting to think about
in this context as well, is denormalization, thus in the Lucene document of
some HBase row information is stored from related (linked) rows. This
requires that, when one row changes, you need to find out what other rows
denormalize info from this row, and update the Lucene documents of those
rows as well. Just bringing this up as a random feature to think about ;-)

-- 
Bruno Dumon
Outerthought
http://outerthought.org/

--0015175cba6623c742049c416098--