lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Goller <gol...@detego-software.de>
Subject Re: PATCH: TestIndexWriter
Date Thu, 11 Sep 2003 14:47:26 GMT
writer.docCount() adds up the docCount values from segmentInfos. The problem
is that currently these values are not updated if documents get deleted and that
the values for new segments during merge are taken from the old segmentInfos.
My patch makes writer.docCount() deliver the same results as reader.maxDoc(),
which reflects deletion of documents in a segment not before the segment is
merged. This is the difference to reader.numDocs() that is updated immediately.
Look what is tested after deleting 50 documents:

           writer  = new IndexWriter(dir, new WhitespaceAnalyzer(), false);
           assertEquals(100, writer.docCount()); <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
           writer.close();

           reader = IndexReader.open(dir);
           assertEquals(100, reader.maxDoc());    <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
           assertEquals(50, reader.numDocs());
           reader.close();

           writer  = new IndexWriter(dir, new WhitespaceAnalyzer(), false);
           writer.optimize();
           assertEquals(50, writer.docCount());
           writer.close();

           reader = IndexReader.open(dir);
           assertEquals(50, reader.maxDoc());
           assertEquals(50, reader.numDocs());
           reader.close();


Maybe I did not think far enough. Writer.docCount() could deliver the
same values as reader.numDocs(). However, this would require changes in
IndexReader. reader.doClose() would have to change values in segmentInfos.
Currently IndexReader only reads segmentInfos. This also makes sense with
respect to merging if we keep in mind that the docCount values in
segmentInfos are used for controlling the merge process. I think about it,
maybe I submit another patch soon. Please wait a little bit with committing
my IndexWriter patch. Maybe it will become obsolete.

Christoph


Otis Gospodnetic schrieb:
> Christoph,
> 
> Thank you for expanding the coverage of the test.
> However, this looks wrong to me:
> 
> -          assertEquals(50, writer.docCount());
> +          assertEquals(100, writer.docCount());
> 
> Aren't you trying to fix IndexWriter so that after adding 100 and
> deleting 50 documents, its docCount() method returns 50?
> The above suggests that the correct behaviour is to return 100, even
> though 50 have been deleted, and only 50 documents are left in the
> index.
> 
> Could you please clarify this for me, before I commit the patches to
> (Test)IndexWriter?
> 
> Thanks,
> Otis
> 
> 
> --- Christoph Goller <goller@detego-software.de> wrote:
> 
>>Sorry, here is the patch.
>>
>>Otis Gospodnetic schrieb:
>>
>>>Christoph,
>>>
>>>The idea looks good, but the test fails for both pre-patched as
>>
>>well as
>>
>>>patched version of IndexWriter.
>>>
>>>I converted your test to JUnit test and will check it into CVS
>>
>>shortly.
>>
>>>If I made a mistake in it, please point it out.
>>>You can run 'ant test-unit' to see where the test fails.
>>>
>>>Otis
>>>
>>>--- Christoph Goller <goller@detego-software.de> wrote:
>>>
>>>
>>>>IndexWriter implements the method docCount() which reads the number
>>>>of documents from the SegmentInfos of the index. However, it
>>
>>delivers
>>
>>>>incorrect values if documents get deleted from the index. The
>>
>>reason
>>
>>>>for
>>>>this is that SegmentInfo.docCounts are updated in an incorrect way
>>>>when
>>>>segments get merged. The new value is taken from the old
>>>>SegmentInfos.
>>>>It would be better to take the value from the reader instead. In
>>
>>this
>>
>>>>way indexWriter.docCount() would deliver the same value as
>>>>indexReader.maxDoc().
>>>>
>>>>test and patch are attached,
>>>>Christoph
>>>>
>>>>
>>>>-- 
>>>>*****************************************************************
>>>>* Dr. Christoph Goller       Tel.:   +49 89 203 45734           *
>>>>* Detego Software GmbH       Mobile: +49 179 1128469            *
>>>>* Keuslinstr. 13             Fax.:   +49 721 151516176          *
>>>>* 80798 München, Germany     Email:  goller@detego-software.de  *
>>>>*****************************************************************
>>>>
>>>>
>>>>>Index: IndexWriter.java
>>>>
>>>>===================================================================
>>>>RCS file:
>>>>
>>>
>>>
> /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/IndexWriter.java,v
> 
>>>>retrieving revision 1.14
>>>>diff -u -r1.14 IndexWriter.java
>>>>--- IndexWriter.java	12 Aug 2003 15:05:03 -0000	1.14
>>>>+++ IndexWriter.java	3 Sep 2003 14:55:33 -0000
>>>>@@ -355,7 +355,7 @@
>>>>      if ((reader.directory == this.directory) || // if we own the
>>>>directory
>>>>          (reader.directory == this.ramDirectory))
>>>>	segmentsToDelete.addElement(reader);	  // queue segment for
>>>>deletion
>>>>-      mergedDocCount += si.docCount;
>>>>+      mergedDocCount += reader.numDocs();
>>>>    }
>>>>    if (infoStream != null) {
>>>>      infoStream.println();
>>>>
>>>>
>>>>>import java.io.IOException;
>>>>
>>>>import org.apache.lucene.analysis.WhitespaceAnalyzer;
>>>>import org.apache.lucene.document.Document;
>>>>import org.apache.lucene.document.Field;
>>>>import org.apache.lucene.index.IndexReader;
>>>>import org.apache.lucene.index.IndexWriter;
>>>>import org.apache.lucene.store.Directory;
>>>>import org.apache.lucene.store.RAMDirectory;
>>>>
>>>>/*
>>>>* Created on 03.09.2003
>>>>*
>>>>* To change the template for this generated file go to
>>>>* Window>Preferences>Java>Code Generation>Code and Comments
>>>>*/
>>>>
>>>>/**
>>>>* 
>>>>* @author goller
>>>>*/
>>>>public class IndexWriterDocCountTest {
>>>>   
>>>>   int docCount = 0;
>>>> 
>>>>     void addDoc(IndexWriter writer)
>>>>     {
>>>>       Document doc = new Document();
>>>>   
>>>>       doc.add(Field.Keyword("id","id" + docCount));
>>>>       doc.add(Field.UnStored("content","aaa"));
>>>>   
>>>>       try {
>>>>         writer.addDocument(doc);
>>>>       }
>>>>       catch (IOException e) {
>>>>         // TODO Auto-generated catch block
>>>>         e.printStackTrace();
>>>>       }
>>>>       docCount++;
>>>>     }
>>>>   
>>>>   
>>>>
>>>>   public static void main(String[] args) {
>>>>       
>>>>       Directory dir = new RAMDirectory();
>>>>       IndexWriterDocCountTest test = new
>>
>>IndexWriterDocCountTest();
>>
>>>>   
>>>>       IndexWriter writer = null;
>>>>       IndexReader reader = null;
>>>>       int i;
>>>>   
>>>>       try {
>>>>         writer  = new IndexWriter(dir, new WhitespaceAnalyzer(),
>>>>true);
>>>>     
>>>>         for (i = 0; i < 100; i++)
>>>>           test.addDoc(writer);
>>>>     
>>>>         System.out.println("docCount: " + writer.docCount());
>>>>         writer.close();
>>>>         
>>>>         reader = IndexReader.open(dir);
>>>>         for (i = 0; i < 50; i++)
>>>>           reader.delete(i);
>>>>         reader.close();
>>>>         System.out.println("doc #0-49 deleted");
>>>>         
>>>>         writer  = new IndexWriter(dir, new WhitespaceAnalyzer(),
>>>>false);
>>>>         System.out.println("docCount: " + writer.docCount());
>>>>         
>>>>         writer.optimize();
>>>>         System.out.println("optimized called");
>>>>         System.out.println("docCount: " + writer.docCount());
>>>>         writer.close();
>>>>         
>>>>       }
>>>>       catch (IOException e) {
>>>>         // TODO Auto-generated catch block
>>>>         e.printStackTrace();
>>>>       }
>>>>   }
>>>>}
>>>>
>>>>
>>>
>>---------------------------------------------------------------------
>>
>>>>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>>>>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>>>
>>>
>>>
>>>__________________________________
>>>Do you Yahoo!?
>>>Yahoo! SiteBuilder - Free, easy-to-use web site design software
>>>http://sitebuilder.yahoo.com
>>>
>>>
>>
>>---------------------------------------------------------------------
>>
>>>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>>>
>>>
>>
>>-- 
>>*****************************************************************
>>* Dr. Christoph Goller       Tel.:   +49 89 203 45734           *
>>* Detego Software GmbH       Mobile: +49 179 1128469            *
>>* Keuslinstr. 13             Fax.:   +49 721 151516176          *
>>* 80798 München, Germany     Email:  goller@detego-software.de  *
>>*****************************************************************
>>
>>>Index: TestIndexWriter.java
>>
>>===================================================================
>>RCS file:
>>
> 
> /home/cvspublic/jakarta-lucene/src/test/org/apache/lucene/index/TestIndexWriter.java,v
> 
>>retrieving revision 1.1
>>diff -u -r1.1 TestIndexWriter.java
>>--- TestIndexWriter.java	10 Sep 2003 12:58:37 -0000	1.1
>>+++ TestIndexWriter.java	10 Sep 2003 16:29:31 -0000
>>@@ -47,10 +47,23 @@
>>           reader.close();
>> 
>>           writer  = new IndexWriter(dir, new WhitespaceAnalyzer(),
>>false);
>>-          assertEquals(50, writer.docCount());
>>+          assertEquals(100, writer.docCount());
>>+          writer.close();
>>+          
>>+          reader = IndexReader.open(dir);
>>+          assertEquals(100, reader.maxDoc());
>>+          assertEquals(50, reader.numDocs());
>>+          reader.close();
>>+          
>>+          writer  = new IndexWriter(dir, new WhitespaceAnalyzer(),
>>false);
>>           writer.optimize();
>>           assertEquals(50, writer.docCount());
>>           writer.close();
>>+          
>>+          reader = IndexReader.open(dir);
>>+          assertEquals(50, reader.maxDoc());
>>+          assertEquals(50, reader.numDocs());
>>+          reader.close();
>>         }
>>         catch (IOException e) {
>>           e.printStackTrace();
>>
>>
> ---------------------------------------------------------------------
> 
>>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 
> 
> __________________________________
> Do you Yahoo!?
> Yahoo! SiteBuilder - Free, easy-to-use web site design software
> http://sitebuilder.yahoo.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 

-- 
*****************************************************************
* Dr. Christoph Goller       Tel.:   +49 89 203 45734           *
* Detego Software GmbH       Mobile: +49 179 1128469            *
* Keuslinstr. 13             Fax.:   +49 721 151516176          *
* 80798 München, Germany     Email:  goller@detego-software.de  *
*****************************************************************


Mime
View raw message