lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rutherglen <jason.rutherg...@gmail.com>
Subject Re: Some thoughts around the use of reader.isDeleted and hasDeletions
Date Tue, 09 Jun 2009 18:51:38 GMT
> I searched the code and was surprised to see isDeleted and hasDeletions
are not called from any search code.

It was weeded out over time, MatchAllDocsQuery for example used to call it.
I think it was to offer users (who are using isDeleted) a way to access
deleted docs without a performance hit.

In the future deletedDocs will be a standardized randomaccessfilter with an
modifiable API that can be used across any filter.  Basically genericizing
and modularizing the API.  For example, I believe deleted docs could be
stored in column stride fields (when it's completed)?

On Mon, Jun 8, 2009 at 1:07 PM, Shai Erera <serera@gmail.com> wrote:

> Hi
>
> I recently read CHANGES to learn more about the readOnly parameter
> IndexReader now supports, and came across LUCENE-1329 with a comment that
> isDeleted was made not synchronized if readOnly=true (e.g.
> ReadOnlyIndexReader), which can affect search code, as it is usually the
> bottleneck for search operations.
>
> I searched the code and was surprised to see isDeleted and hasDeletions are
> not called from any search code. Instead, the code, such as SegmentTermDocs,
> uses the deletedDocs instance directly. So in fact isDeleted wasn't the
> bottleneck (unless this was part of the changes in 1329). Anyway, doesn't
> matter, that's good !
>
> However, I did find out some indexing code whic calls these two, like
> SegmentMerger when it merges fields and term vectors. I think that we can
> improve that code by writing some specialized code for merging - if the
> reader has no deletions, there's no point checking for each document if
> there are deletions and/or if the document was deleted. In fact,
> SegmentMerger checks for each document: (1) whether the reader has deletions
> and if the document was deleted, (2) if the reader has a matching reader and
> (3) if checkAbort is not null.
>
> I have a couple of suggestions to simplify that code:
> 1. Create two specialized copyFieldsWithDeletions/copyFieldsNoDeletions to
> get rid of the unnecessary if (hasDeletions) check for each doc.
> 2. In these, check if the reader has matching reader or not, and execute
> the appropriate code in each case.
> 3. Also, check up front if checkAbort != null.
> 3.1 (3) duplicates the code - so perhaps instead I can create a dummy
> checkAbort, which does nothing? That way we'll always call
> checkAbort.work(). But this adds a method call, so I'm not sure.
>
> (same optimizations for mergeVectors()).
>
> In addition, I think something can be done w/ SegmentTermDocs. Either
> create a specialized STD based on whether it has deletions or not, or create
> a dummy BitVector which returns false for every check. That way we can
> eliminate the checks in each next(), skipTo(). Dummy BitVector will leave
> the file as-is, but will add a method call, so I think I lean towards the
> specialized STD. This can be as simple as making STD abstract, with a static
> create() method which creates the right instance.
>
> I believe there are other places where we can make such optimizations. If
> the overall direction seems right to you, I can open an issue and start to
> work on a patch. Currently I've patched SegmentMerger, and besides the class
> getting bigger, nothing bad happened (meaning all tests pass), though it
> might be interesting to check how it affects performance.
>
> What do you think?
>
> Shai
>

Mime
View raw message