lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Some thoughts around the use of reader.isDeleted and hasDeletions
Date Tue, 09 Jun 2009 15:52:58 GMT
> I recently read CHANGES to learn more about the readOnly parameter
> IndexReader now supports, and came across LUCENE-1329 with a comment that
> isDeleted was made not synchronized if readOnly=true (e.g.
> ReadOnlyIndexReader), which can affect search code, as it is usually the
> bottleneck for search operations.
>
> I searched the code and was surprised to see isDeleted and hasDeletions are
> not called from any search code. Instead, the code, such as SegmentTermDocs,
> uses the deletedDocs instance directly. So in fact isDeleted wasn't the
> bottleneck (unless this was part of the changes in 1329). Anyway, doesn't
> matter, that's good !

You're right!  (I had also thought it was more widely used). I guess
apps that iterate through many docs (calling document()) might hit
this?  Also, it's only in 2.9 that MatchAllDocsQuery doesn't use
isDeleted.  So I agree the benefit of using readOnly reader is
presumably not much, as of 2.9...

Actually: I think we should also change IndexReader.document to not
check if it's deleted?  (Renaming it to something like rawDocument(),
storedDocument(), something, in the process, and deprecating the old
one).

> However, I did find out some indexing code whic calls these two, like
> SegmentMerger when it merges fields and term vectors. I think that we can
> improve that code by writing some specialized code for merging - if the
> reader has no deletions, there's no point checking for each document if
> there are deletions and/or if the document was deleted. In fact,
> SegmentMerger checks for each document: (1) whether the reader has deletions
> and if the document was deleted, (2) if the reader has a matching reader and
> (3) if checkAbort is not null.
>
> I have a couple of suggestions to simplify that code:
> 1. Create two specialized copyFieldsWithDeletions/copyFieldsNoDeletions to
> get rid of the unnecessary if (hasDeletions) check for each doc.
>
> 2. In these, check if the reader has matching reader or not, and execute the
> appropriate code in each case.

I think this make sense.

> 3. Also, check up front if checkAbort != null.

Be careful: checkAbort needs to be called "fairly" frequently so
IndexWriter.close(false) doesn't take too long.

> 3.1 (3) duplicates the code - so perhaps instead I can create a dummy
> checkAbort, which does nothing? That way we'll always call
> checkAbort.work(). But this adds a method call, so I'm not sure.

I think a dummy impl is fine; it's only null during unit tests and
in IndexWriter.addIndexes(IndexReader[]).

> In addition, I think something can be done w/ SegmentTermDocs. Either create
> a specialized STD based on whether it has deletions or not, or create a
> dummy BitVector which returns false for every check. That way we can
> eliminate the checks in each next(), skipTo(). Dummy BitVector will leave
> the file as-is, but will add a method call, so I think I lean towards the
> specialized STD. This can be as simple as making STD abstract, with a static
> create() method which creates the right instance.
>
> I believe there are other places where we can make such optimizations. If
> the overall direction seems right to you, I can open an issue and start to
> work on a patch. Currently I've patched SegmentMerger, and besides the class
> getting bigger, nothing bad happened (meaning all tests pass), though it
> might be interesting to check how it affects performance.

I think this makes sense.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message