lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen" <jason.rutherg...@gmail.com>
Subject Re: SegmentReader with custom setting of deletedDocs, single reusable FieldsReader
Date Wed, 25 Jun 2008 11:47:09 GMT
I understand what you are saying.  I am not sure it is worth "clearly quite
a bit more work" given how easy it is to simply be able to have more control
over the IndexReader deletedDocs BitVector which seems like a feature that
should be in there anyways, perhaps even allowing SortedVIntList to be
used.  The other issue with going down the path of integrating too much with
IndexWriter is I am not sure how to integrate the realtime document
additions to IndexWriter which is handled best by InstantiatedIndex.  When
merging needs to happen in Ocean the IndexWriter.addIndexes(IndexReader[]
readers) is used to merge SegmentReaders and InstantiatedIndexReaders.

One of the things I do not understand about IndexWriter deletes is it does
not reuse an already open TermInfosReader with the tii loaded.  Isn't this
slower than deleting using an already open IndexReader?

In any case the method of using deletedDocs in SegmentReader using the patch
given seems to work quite well in Ocean now.  I think long term there is
probably some way to integrate more with IndexWriter, but really that is
something more in line with removing the concept of IndexReader and
IndexWriter and creating an IndexReaderWriter class.

On Wed, Jun 25, 2008 at 6:29 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

>
> Jason Rutherglen wrote:
>
>  One of the bottlenecks I have noticed testing Ocean realtime search is the
>> delete process which involves writing several files for each possibly single
>> delete of a document in SegmentReader.  The best way to handle the deletes
>> is too simply keep them in memory without flushing them to disk, saving on
>> writing out an entire BitVector per delete.  The deletes are saved in the
>> transaction log which is be replayed on recovery.
>>
>> I am not sure of the best way to approach this, perhaps it is creating a
>> custom class that inherits from SegmentReader.  It could reuse the existing
>> reopen and also provide a way to set the deletedDocs BitVector.  Also it
>> would be able to reuse FieldsReader by providing locking around FieldsReader
>> for all SegmentReaders of the segment to use.  Otherwise in the current
>> architecture each new SegmentReader opens a new FieldsReader which is
>> non-optimal.  The deletes would be saved to disk but instead of per delete,
>> periodically like a checkpoint.
>>
>
> Or ... maybe you could do the deletes through IndexWriter (somehow, if we
> can get docIDs properly) and then SegmentReaders could somehow tap into the
> buffered deleted docIDs that IndexWriter already maintains.  IndexWriter is
> already doing this buffering, flush/commit anyway.
>
> We've also discussed at one point creating an IndexReader impl that
> searches the RAM buffer that DocumentsWriter writes to when adding
> documents.  I think it's easier than it sounds, on first glance, because
> DocumentsWriter is in fact writing the postings in nearly the same format as
> is used when the segment is flushed.
>
> So if we had this IndexReader impl, plus extended SegmentReader so it could
> tap into pending deletes buffered in IndexWriter, you could get realtime
> search without having to use Directory as an intermediary.  Though, it is
> clearly quite a bit more work :)
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message