lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: PATCH: IndexWriter
Date Mon, 15 Sep 2003 12:39:47 GMT
Christoph,

Option (A) seems to be a better thing to do after all.
Isn't that the IndexWriter patch that you ... ah, yes, you say that
yourself below.

Thanks again, I'll commit the patched IndexWriter now.

Otis

--- Christoph Goller <goller@detego-software.de> wrote:
> I thought things over and I now think there are two possible options
> for coping with the indexWriter.docCount() bug. I cannot decide this
> alone. Maybe voting is needed.
> 
> Problem:
> 
> writer.docCount() adds up the docCount values from segmentInfos.
> Note that currently only IndexWriter writes segmentInfos ("segments"
> file).
> IndexReader only reads them. The problem is that segmentInfo.docCount
> values are updated incorrectly in indexWriter.mergeSegments.
> Information
> about deleted documents is ignored and therefore segmentInfo.docCount
> values for new segments become too big and do not reflect the real
> size
> of the new segments. This has two effects. Firstly, writer.docCount()
> becomes incorrect, secondly the merge process is controlled by
> incorrect
> values about segment size. Note that the the docCount values from
> segmentInfos are used to control the merge process.
> 
> Option (A)
> 
> This is the IndexWriter patch that I submitted. This patch has the
> effect
> that segmentInfo.docCount values represent the real size of the
> segments.
> Even if a document is deleted, it is still there until the segment
> gets
> merged. For every segment the corresponding segmentInfo.docCount
> values
> delivers the same value that a reader on this segment would deliver
> with
> reader.maxDoc(). Off course this also means that for readers and
> writers
> on the whole index reader.maxDoc() == writer.docCount().
> 
> Option (B)
> 
> This option leaves IndexWriter as it was. IndexReader has to be
> changed.
> Instead of only reading segmentInfos ("segments" file) IndexReader
> would
> have to write segmentInfos if documents have been deleted. I would do
> that
> in reader.doClose. The effect would be that for every segment
> segmentInfo.docCount would deliver the same value that a reader on
> this
> segment would deliver with reader.numDocs(). For reader and writers
> on the
> whole index we would have reader.numDocs() == writer.docCount(). Here
> segmentInfo.docCount values represent the number of valid documents
> of a
> segment, those documents that have not been deleted.
> 
> I am slightly in favour of option (A) since it is less work to do :-)
> and
> it seems reasonable to use the real size of segments for controlling
> the
> merge process. However, I can also implement option (B).
> 
> Christoph
> 
> Otis Gospodnetic schrieb:
> > Christoph,
> > 
> > Thank you for expanding the coverage of the test.
> > However, this looks wrong to me:
> > 
> > -          assertEquals(50, writer.docCount());
> > +          assertEquals(100, writer.docCount());
> > 
> > Aren't you trying to fix IndexWriter so that after adding 100 and
> > deleting 50 documents, its docCount() method returns 50?
> > The above suggests that the correct behaviour is to return 100,
> even
> > though 50 have been deleted, and only 50 documents are left in the
> > index.
> > 
> > Could you please clarify this for me, before I commit the patches
> to
> > (Test)IndexWriter?
> > 
> > Thanks,
> > Otis
> > 
> > 
> > --- Christoph Goller <goller@detego-software.de> wrote:
> > 
> 
> -- 
> *****************************************************************
> * Dr. Christoph Goller       Tel.:   +49 89 203 45734           *
> * Detego Software GmbH       Mobile: +49 179 1128469            *
> * Keuslinstr. 13             Fax.:   +49 721 151516176          *
> * 80798 München, Germany     Email:  goller@detego-software.de  *
> *****************************************************************
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

Mime
View raw message