hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: HBase and Lucene for realtime search
Date Sat, 12 Feb 2011 02:20:49 GMT
Go for it!

On Fri, Feb 11, 2011 at 4:44 PM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> > Michi's stuff uses flexible indexing with a zero lock architecture.  The
> > speed *is* much higher.
>
> The speed's higher, and there isn't much Lucene left there either, as
> I believe it was built specifically for the 140 characters use case
> (eg, not the general use case).  I don't think most indexes can be
> compressed to only exist in RAM on a single server?  The Twitter use
> case isn't one that the HBase RT search solution is useful for?
>
> > If you were to store entire posting vectors as values with terms as keys,
> > you might be OK.  Very long posting vectors or add-ons could be added
> using
> > a key+serial number trick.
>
> This sounds like the right approach to try.  Also, the Lucene terms
> dict is sorted anyways, so moving the terms into HBase's sorted keys
> probably makes sense.
>
> > For updates, speed would only be acceptable if you batch up a
> > lot updates or possibly if you build in a value append function as a
> > co-processor.
>
> Hmm... I think the main issue would be the way Lucene implements
> deletes (eg, today as a BitVector).  I think we'd keep that
> functionality.  The new docs/updates would be added to the
> in-RAM-buffer.  I think there'd be a RAM size based flush as there is
> today.  Where that'd be flushed to is an open question.
>
> I think the key advantages to the RT + HBase architecture is the index
> would live alongside HBase columns, and so all other scaling problems
> (especially those related to scaling RT, such as synchronization of
> distributed data and updates) goes away.
>
> A distributed query would remain the same, eg, it'd hit N servers?
>
> In addition, Lucene offers a wide variety of new query types which
> HBase'd get in realtime for free.
>
> On Fri, Feb 11, 2011 at 4:13 PM, Ted Dunning <tdunning@maprtech.com>
> wrote:
> > On Fri, Feb 11, 2011 at 3:50 PM, Jason Rutherglen <
> > jason.rutherglen@gmail.com> wrote:
> >
> >> > I can't imagine that the speed achieved by using Hbase would be even
> >> within
> >> > orders of magnitude of what you can do in Lucene 4 (or even 3).
> >>
> >> The indexing speed in Lucene hasn't changed in quite a while, are you
> >> saying HBase would somehow be overloaded?  That doesn't seem to jive
> >> with the sequential writes HBase performs?
> >>
> >
> > Michi's stuff uses flexible indexing with a zero lock architecture.  The
> > speed *is* much higher.
> >
> > The real problem is that hbase repeats keys.
> >
> > If you were to store entire posting vectors as values with terms as keys,
> > you might be OK.  Very long posting vectors or add-ons could be added
> using
> > a key+serial number trick.
> >
> > Short queries would involve reading and merging several posting vectors.
>  In
> > that mode, query speeds might be OK, but there isn't a lot of Lucene left
> at
> > that point.  For updates, speed would only be acceptable if you batch up
> a
> > lot updates or possibly if you build in a value append function as a
> > co-processor.
> >
> >
> >
> >> The speed of indexing is a function of creating segments, with
> >> flexible indexing, the underlying segment files (and postings) may be
> >> significantly altered from the default file structures, eg, placed
> >> into HBase in various ways.  The posting lists could even be split
> >> along with HBase regions?
> >>
> >
> > Possibly.  But if you use term + counter and post vectors of limited
> length
> > you might be OK.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message