lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Goller <gol...@detego-software.de>
Subject Re: PATCH: IndexWriter
Date Fri, 12 Sep 2003 08:33:10 GMT
I thought things over and I now think there are two possible options
for coping with the indexWriter.docCount() bug. I cannot decide this
alone. Maybe voting is needed.

Problem:

writer.docCount() adds up the docCount values from segmentInfos.
Note that currently only IndexWriter writes segmentInfos ("segments" file).
IndexReader only reads them. The problem is that segmentInfo.docCount
values are updated incorrectly in indexWriter.mergeSegments. Information
about deleted documents is ignored and therefore segmentInfo.docCount
values for new segments become too big and do not reflect the real size
of the new segments. This has two effects. Firstly, writer.docCount()
becomes incorrect, secondly the merge process is controlled by incorrect
values about segment size. Note that the the docCount values from
segmentInfos are used to control the merge process.

Option (A)

This is the IndexWriter patch that I submitted. This patch has the effect
that segmentInfo.docCount values represent the real size of the segments.
Even if a document is deleted, it is still there until the segment gets
merged. For every segment the corresponding segmentInfo.docCount values
delivers the same value that a reader on this segment would deliver with
reader.maxDoc(). Off course this also means that for readers and writers
on the whole index reader.maxDoc() == writer.docCount().

Option (B)

This option leaves IndexWriter as it was. IndexReader has to be changed.
Instead of only reading segmentInfos ("segments" file) IndexReader would
have to write segmentInfos if documents have been deleted. I would do that
in reader.doClose. The effect would be that for every segment
segmentInfo.docCount would deliver the same value that a reader on this
segment would deliver with reader.numDocs(). For reader and writers on the
whole index we would have reader.numDocs() == writer.docCount(). Here
segmentInfo.docCount values represent the number of valid documents of a
segment, those documents that have not been deleted.

I am slightly in favour of option (A) since it is less work to do :-) and
it seems reasonable to use the real size of segments for controlling the
merge process. However, I can also implement option (B).

Christoph

Otis Gospodnetic schrieb:
> Christoph,
> 
> Thank you for expanding the coverage of the test.
> However, this looks wrong to me:
> 
> -          assertEquals(50, writer.docCount());
> +          assertEquals(100, writer.docCount());
> 
> Aren't you trying to fix IndexWriter so that after adding 100 and
> deleting 50 documents, its docCount() method returns 50?
> The above suggests that the correct behaviour is to return 100, even
> though 50 have been deleted, and only 50 documents are left in the
> index.
> 
> Could you please clarify this for me, before I commit the patches to
> (Test)IndexWriter?
> 
> Thanks,
> Otis
> 
> 
> --- Christoph Goller <goller@detego-software.de> wrote:
> 

-- 
*****************************************************************
* Dr. Christoph Goller       Tel.:   +49 89 203 45734           *
* Detego Software GmbH       Mobile: +49 179 1128469            *
* Keuslinstr. 13             Fax.:   +49 721 151516176          *
* 80798 M√ľnchen, Germany     Email:  goller@detego-software.de  *
*****************************************************************


Mime
View raw message