hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: HBase and Lucene for realtime search
Date Mon, 14 Feb 2011 20:04:32 GMT
As you like.

My experience is that analyzing a document takes longer than I want to cause
the user to wait when inserting it.  I almost always prefer write-behind
indexing of some kind.

On Mon, Feb 14, 2011 at 11:28 AM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> > The analysis can be very slow if you are doing Tika things and named
> entity
> > extraction and PDF interpretation and so on.
>
> I'd consider those different/separate use cases where likely realtime
> isn't important?  If large [static] documents are being stored in
> HBase why would expediency be required?
>
> On Mon, Feb 14, 2011 at 11:18 AM, Ted Dunning <tdunning@maprtech.com>
> wrote:
> > The analysis can be very slow if you are doing Tika things and named
> entity
> > extraction and PDF interpretation and so on.
> >
> > On Mon, Feb 14, 2011 at 11:09 AM, Jason Rutherglen <
> > jason.rutherglen@gmail.com> wrote:
> >
> >> The older versions of Lucene NRT indexing is slow, the newer version
> >> with RT will be as fast as Lucene's batch indexing is today, which I'm
> >> guessing will be fast enough for many/most users?  Eg, it's simply
> >> analyzing and throwing the data into a RAM buffer (there's no IO or
> >> segment merging happening).
> >>
> >> On Mon, Feb 14, 2011 at 10:57 AM, Ted Dunning <tdunning@maprtech.com>
> >> wrote:
> >> > I would find that unacceptable for many systems I have worked on.
>  Lucene
> >> > update-behind would be fine, but waiting the insert until all of the
> >> Lucene
> >> > stuff happened would not be acceptable.
> >> >
> >> > I would much rather that Lucene update from the write log in batches
> that
> >> > are as big as needed to catch/keep up.
> >> >
> >> > On Mon, Feb 14, 2011 at 9:48 AM, Jason Rutherglen <
> >> > jason.rutherglen@gmail.com> wrote:
> >> >
> >> >> > Yes, that should work. But doesn't it assume that the index is
> updated
> >> >> > synchronously with the HBase row? I can imagine this will sometimes
> be
> >> an
> >> >> > issue, e.g. if it would involve performing expensive content
> >> extraction
> >> >> > (tika) or analysis.
> >> >>
> >> >> I don't understand here.  You mean that the delay in indexing a
> >> >> document will adversely affect the HBase row insert because it's all
> >> >> in the same transaction?  I think that fine, eg, it's just how the
> >> >> system'd work?
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message