lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@lucene.com>
Subject Re: PATCH: IndexWriter
Date Fri, 12 Sep 2003 17:31:44 GMT
I prefer option (A).  SegmentInfo.docCount is the number of documents 
stored in the segment, regardless of deletions, unlike 
IndexReader.docCount() which is the number of non-deleted documents in 
an index.  So perhaps SegmentInfo.docCount should be renamed, but I 
don't think we should change it's semantics.  Deletions should not be 
considered when deciding whether to automatically merge segments, which 
is another place where SegmentInfo.docCount is used.

Doug

Christoph Goller wrote:
> I thought things over and I now think there are two possible options
> for coping with the indexWriter.docCount() bug. I cannot decide this
> alone. Maybe voting is needed.
> 
> Problem:
> 
> writer.docCount() adds up the docCount values from segmentInfos.
> Note that currently only IndexWriter writes segmentInfos ("segments" file).
> IndexReader only reads them. The problem is that segmentInfo.docCount
> values are updated incorrectly in indexWriter.mergeSegments. Information
> about deleted documents is ignored and therefore segmentInfo.docCount
> values for new segments become too big and do not reflect the real size
> of the new segments. This has two effects. Firstly, writer.docCount()
> becomes incorrect, secondly the merge process is controlled by incorrect
> values about segment size. Note that the the docCount values from
> segmentInfos are used to control the merge process.
> 
> Option (A)
> 
> This is the IndexWriter patch that I submitted. This patch has the effect
> that segmentInfo.docCount values represent the real size of the segments.
> Even if a document is deleted, it is still there until the segment gets
> merged. For every segment the corresponding segmentInfo.docCount values
> delivers the same value that a reader on this segment would deliver with
> reader.maxDoc(). Off course this also means that for readers and writers
> on the whole index reader.maxDoc() == writer.docCount().
> 
> Option (B)
> 
> This option leaves IndexWriter as it was. IndexReader has to be changed.
> Instead of only reading segmentInfos ("segments" file) IndexReader would
> have to write segmentInfos if documents have been deleted. I would do that
> in reader.doClose. The effect would be that for every segment
> segmentInfo.docCount would deliver the same value that a reader on this
> segment would deliver with reader.numDocs(). For reader and writers on the
> whole index we would have reader.numDocs() == writer.docCount(). Here
> segmentInfo.docCount values represent the number of valid documents of a
> segment, those documents that have not been deleted.
> 
> I am slightly in favour of option (A) since it is less work to do :-) and
> it seems reasonable to use the real size of segments for controlling the
> merge process. However, I can also implement option (B).
> 
> Christoph
> 
> Otis Gospodnetic schrieb:
> 
>> Christoph,
>>
>> Thank you for expanding the coverage of the test.
>> However, this looks wrong to me:
>>
>> -          assertEquals(50, writer.docCount());
>> +          assertEquals(100, writer.docCount());
>>
>> Aren't you trying to fix IndexWriter so that after adding 100 and
>> deleting 50 documents, its docCount() method returns 50?
>> The above suggests that the correct behaviour is to return 100, even
>> though 50 have been deleted, and only 50 documents are left in the
>> index.
>>
>> Could you please clarify this for me, before I commit the patches to
>> (Test)IndexWriter?
>>
>> Thanks,
>> Otis
>>
>>
>> --- Christoph Goller <goller@detego-software.de> wrote:
>>
> 


Mime
View raw message