lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Volkman <jvolk...@gmail.com>
Subject Re: Filtering documents out of IndexReader
Date Wed, 15 Apr 2009 01:25:58 GMT
I'll probably end up using a filtered IndexSearcher, but let me try to take
a step back and explain what I'm trying to do since It relates to a lot of
recent development in trunk (this probably belongs under java-user now).

We use Lucene in combination with MySQL to store data in a legacy homegrown
CMS. All data is stored as key/value pairs, much like Lucene's Documents,
and all querying is done through Lucene. One requirement that much of the
CMS has been built with is that the search index provides database-like
write/query consistency (read: as soon as an item is added, updated or
deleted, that is reflected in queries). This is obviously much improved with
NRT search.

So my current strategy is very similar to what Jason has going on in
LUCENE-1313. I've got a disk index, a RAM index, and 0..n indexes in between
waiting to be flushed to disk (indexes are pushed onto the flush queue when
they hit some predefined size -- 5mb).

This is all very pretty and straight forward on the whiteboard, but there
are a number of subtleties that pop up during implementation. Each time the
index is updated, the current IndexReader is marked as needing replcement.
The next time a reader is needed (for a search, generally), I have to gather
up all of the indexes in a correct state and get current readers for them.
Since we're going for consistency, this (currently) means blocking out
writes while I'm creating a reader (there may be some more complicated and
efficient way to go about this). I return a MultiReader including: an
IndexReader for the disk index, IW.getReader() for the current RAM index,
IW.getReader() for all queued RAM indexes (including one, if any, currently
being flushed to disk).

I was originally using an IndexWriter and IW.getReader() for the disk index
as well, but to keep my consistency I had to block reader creation while
writing an index out to disk. Since IW doesn't support adding and deleting a
set of documents atomically, it would be possible for a reader to call
IW.getReader() while I'm in the middle of adding and removing documents from
the disk writer. I could possibly use addIndexesNoOptimize, but the bit
about it possibly requiring 2x index space scared me away.

So now I'm using a plain IndexReader for the disk. Like 1313, each time
something is updated or removed from the primary RAM index, I have to
suppress the same content in all of the queued RAM indexes and the disk
index. For the small RAM indexes, I'm doing this with IndexWriter.delete +
IW.getReader(). For the disk index, I'm keeping a BitVector of filtered docs
and searching for docs in the current disk IndexReader that match the
provided updated/deleted Terms. This is why I was looking for a filterable
IndexReader.

Implementing this way allows me to write RAM indexes out to disk without
blocking readers, and only block readers when I need to remap any filtered
docs that may have been updated or deleted during the flushing process. I
think this may beat using a straight IW for my requirements, but I'm not
positive yet.

So I've currently got a SuppressedIndexReader extends FilterIndexReader, but
due to 1483 and 1573 I had to implement IndexReader.getFieldCacheKey() to
get any sort of decent search performance, which I'd rather not do since I'm
aware its only temporary.

So, I have a couple of questions:

Is it possible to perform a bunch of adds and deletes from an IW in an
atomic action? Should I use addIndexesNoOptimize?

If I go the filtered searcher direction, my filter will have to be aware of
the portion of the MultiReader that corresponds to the disk index. Can I
assume that my disk index will populate the lower portion of doc id space if
it comes first in the list passed to the MultiReader constructor? The code
says yes but the docs don't say anything.

If you've followed any of what I've said and have some suggestions/comments,
they'd be much appreciated.

-Jeremy

On Thu, Apr 9, 2009 at 8:01 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Thu, Apr 9, 2009 at 7:02 PM, Jeremy Volkman <jvolkman@gmail.com> wrote:
>
> > I'm sure I can extend my wrapping reader to also wrap whatever is
> returned
> > by getSequentialSubReaders, however all of what I'm writing is already
> done
> > by IndexReader with respect to deletions. What if, instead of throwing
> > UnsupportedOperationExceptions, a read-only IndexReader did everything it
> > normally does with deletes up to the point of actually writing the .del
> > file. This would allow documents to be removed from the reader for the
> > lifetime of the reader, and seems like it might be a minimal change.
>
> Well... readOnly IR relies on its deletedDocs never being changed, to
> allow isDeleted to be unsynchronized.
>
> Is this only for searching?  Could you just use a Filter with your search?
>
> Or... you could make silly FSDirectory extension that pretends to
> write outputs but never does, and pass it to IR.open?
>
> Or maybe we should open up a way to discard pending changes in an IR
> (like IW.rollback).
>
> Or, with near real-time search (in trunk) you could 1) open IW with
> autoCommit=false, 2) make your pretend deletes, 3) get a near
> real-time reader from the IW (IW.getReader()), 4) do stuff with that
> reader, 5) call IW.rollback() to discard your changes when done, and
> close the reader.
>
> One drawback with using deletes "temporarily" (as your filter) is you
> won't be able to do any real deletes.
>
> Mike
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message