lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <luc...@mikemccandless.com>
Subject Re: Realtime Search
Date Thu, 25 Dec 2008 19:24:14 GMT
I think the necessary low-level changes to Lucene for real-time are
actually already well underway...

The biggest barrier is how we now ask for FieldCache values a the
Multi*Reader level.  This makes reopen cost catastrophic for a large
index.

Once we succeed in making FieldCache usage within Lucene
segment-centric (LUCENE-1483 = sorting becomes segment-centric),
LUCENE-831 (= deprecate old FieldCache API in favor of segment-centric
or iteration API), we are most of the way there.  LUCENE-1231 (column
stride fields) should make initing the per-segment FieldCache much
faster, though I think that's a "nice to have" for real-time search
(because either 1) warming will happen in the BG, or 2) the segment is
tiny).

So then I think we should start with approach #2 (build real-time on
top of the Lucene core) and iterate from there.  Newly added docs go
into a tiny segments, which IndexReader.reopen pulls in.  Replaced or
deleted docs record the delete against the right SegmentReader (and
LUCENE-1314 lets reopen carry those pending deletes forward, in RAM).

I would take the simple approach first: use ordinary SegmentReader on
a RAMDirectory for the tiny segments.  If that proves too slow, swap
in Memory/InstantiatedIndex for the tiny segments.  If that proves too
slow, build a reader impl that reads from DocumentsWriter RAM buffer.

One challenge is reopening after a big merge finishes... we'd need a
way to 1) allow the merge to be committed, then 2) start warming a new
reader in the BG, but 3) allow newly flushed segments to use the old
SegmentReaders reading the segments that were merged (because they are
still warm), and 4) once new reader is warm, we decref old segments
and use the new reader going forwards.

Alternatively, and maybe simpler, a merge is not allowed to commit
until a new SegmentReader has been warmed against the newly merged
segment.

I'm not sure how best to do this... we may need more info in
SegmentInfo[s] to track the genealogy of each segment, or something.
We may need to have IndexWriter give more info when it's modifying
SegmentInfos, eg we'd need the reader to access newly flushed segments
(IndexWriter does not write a new segments_N until commit).  Maybe
IndexWriter needs to warm readers... maybe IndexReader.open/reopen
needs to be given an IndexWriter and then access its un-flushed
in-memory SegmentInfos... not sure.  We'd need to fix
SegmentReader.get to provide single instance for a given segment.

I agree we'd want a specialized merge policy.  EG it should merge RAM
segments w/ higher priority, and probably not merge mixed RAM & disk
segments.

Mike
Jason Rutherglen <jason.rutherglen@gmail.com> wrote:

> We've discussed realtime search before, it looks like after the next
> release we can get some sort of realtime search working.  I was going to
> open a new issue but decided it might be best to discuss realtime search on
> the dev list.
>
> Lucene can implement realtime search as the ability to add, update, or
> delete documents with latency in the sub 5 millisecond range.  A couple of
> different options are available.
>
> 1) Expose a rolling set of realtime readers over the memory index used by
> IndexWriter.  Requires incrementally updating field caches and filters, and
> is somewhat unclear how IndexReader versioning would work (for example
> versions of the term dictionary).
> 2) Implement realtime search by incrementally creating and merging readers
> in memory.  The system would use MemoryIndex or InstantiatedIndex to quickly
> (more quickly than RAMDirectory) create indexes from added documents.  The
> in memory indexes would be periodically merged in the background and
> according to RAM used write to disk.  Each update would generate a new
> IndexReader or MultiSearcher that includes the new updates.  Field caches
> and filters could be cached per IndexReader according to how Lucene works
> today.  The downside of this approach is the indexing will not be as fast as
> #1 because of the in memory merging which similar to the Lucene pre 2.3
> which merged in memory segments using RAMDirectory.
>
> Are there other implementation options?
>
> A new patch would focus on providing in memory indexing as part of the core
> of Lucene.  The work of LUCENE-1483 and LUCENE-1314 would be used.  I am not
> sure if option #2 can become part of core if it relies on a contrib module?
> It makes sense to provide a new realtime oriented merge policy that merges
> segments based on the number of deletes rather than a merge factor.  The
> realtime merge policy would keep the segments within a minimum and maximum
> size in kilobytes to limit the time consumed by merging which is assumed
> would occur frequently.
>
> LUCENE-1313 which includes a transaction log with rollback and was designed
> with distributed search and may be retired or the components split out.
>
>

Mime
View raw message