lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <yo...@apache.org>
Subject Re: flushRamSegments() is "over merging"?
Date Tue, 15 Aug 2006 15:28:38 GMT
Yes, that's counter-intuitive.... a high merge factor is more likely
to cause a merge with the last disk-based segment.

On the other hand... if you have a high maxBufferedDocs and a normal
mergeFactor (much more likely), you could end up with way too many
segments if you didn't merge.

Hmmm, I'm thinking of another case where you could end up with far too
many segments... if you have a low merge factor and high
maxBufferedDocs (a common scenario), then if you add enough docs it
will keep creating a separate segment.

Consider the following settings:
mergeFactor=10
maxBufferedDocs=10000

Now add 11 docs at a time to an existing index, closing inbetween.
segment sizes: 100000, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, ...

It seems like the merge logic somewhere should also take into account
the number of segments at a certain level, not just the number of
documents in those segments.

-Yonik

On 8/15/06, Doron Cohen <DORONC@il.ibm.com> wrote:
>
> Hi, I ran into this while reviewing the patch for 565.
>
> It appears that closing an index writer with non empty ram segments (at
> least 1 doc was added) is causing a merge with the last (most recent) on
> disk segment.
>
> This seems to me problematic in the case that an application has a lot of
> interleaving - adding / removing documents, or even switching indexes,
> therefore the indexWriter would be closed often.
>
> The test case below demonstrates this behavior - all maxBufferedDocs,
> maxMergeDocs, mergeFactor are assigned very large values, and in a loop a
> few documents are added and the indexWriter is closed and re-opened.
>
> Surprisingly (at least for me) the number of segments on disk remains 1.
> In other words, each time the IndexWriter is closed, the single disk
> segment is merged with the current ram segments and re-written to a new
> disk segments.
>
> The "blame" is in the second line here:
>     if (minSegment < 0 ||                   // add one FS segment?
>         (docCount + segmentInfos.info(minSegment).docCount) > mergeFactor
> ||
>         !(segmentInfos.info(segmentInfos.size()-1).dir == ramDirectory))
>
> This code in flushRamSegments() merges the (temporary) ram segments with
> the most recent non-temporary segment.
>
> I can see how this can make sense in some cases. Perhaps an additional
> constraint should be added on the ratio of the size of this non-temp
> segment to that of all temporary segments, or the difference, or both.
>
> Here is the test case,
> Thanks,
> Doron
> ------------------------------------
> package org.apache.lucene.index;
>
> import java.io.IOException;
>
> import junit.framework.TestCase;
>
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
> import org.apache.lucene.store.RAMDirectory;
>
> /**
>  * Test that the number of segments is as expected.
>  * I.e. that there was not too many / too few merges.
>  *
>  * @author Doron Cohen
>  */
> public class TestNumSegments extends TestCase {
>
>       protected int nextDocNum = 0;
>       protected Directory dir = null;
>       protected IndexWriter iw = null;
>       protected IndexReader ir = null;
>
>       /* (non-Javadoc)
>        * @see junit.framework.TestCase#setUp()
>        */
>       protected void setUp() throws Exception {
>             super.setUp();
>             //dir = new RAMDirectory();
>             dir = FSDirectory.getDirectory("test.num.segments",true);
>             iw = new IndexWriter(dir, new StandardAnalyzer(), true);
>             setLimits(iw);
>             addSomeDocs(); // some docs in index
>       }
>
>       // for now, take these limits out of the "game"
>       protected void setLimits(IndexWriter iw) {
>             iw.setMaxBufferedDocs(Integer.MAX_VALUE-1);
>             iw.setMaxMergeDocs(Integer.MAX_VALUE-1);
>             iw.setMergeFactor(Integer.MAX_VALUE-1);
>       }
>
>       /* (non-Javadoc)
>        * @see junit.framework.TestCase#tearDown()
>        */
>       protected void tearDown() throws Exception {
>             closeW();
>             if (dir!=null) {
>                   dir.close();
>             }
>             super.tearDown();
>       }
>
>       // count how many segments are on a directory - index writer must be
> closed
>       protected int countDirSegments() throws IOException {
>             assertNull(iw);
>             SegmentInfos segmentInfos = new SegmentInfos();
>             segmentInfos.read(dir);
>             int nSegs = segmentInfos.size();
>             segmentInfos.clear();
>             return nSegs;
>       }
>
>       // open writer
>       private void openW() throws IOException {
>             iw = new IndexWriter(dir, new StandardAnalyzer(), false);
>             setLimits(iw);
>       }
>
>       private void closeW() throws IOException {
>             if (iw!=null) {
>                   iw.close();
>                   iw=null;
>             }
>       }
>
>       public void testNumSegments() throws IOException {
>             int numExceptions = 0;
>             for (int i=1; i<30; i++) {
>                   closeW();
>                   try {
>                         assertEquals("Oops - wrong number of segments!", i,
> countDirSegments());
>                   } catch (Throwable t) {
>                         numExceptions++;
>                         System.err.println(i+":  "+t.getMessage());
>                   }
>                   openW();
>                   addSomeDocs();
>             }
>             assertEquals("Oops!, so many times numbr of egments was
> \"wrong\"",0,numExceptions);
>       }
>
>       private void addSomeDocs() throws IOException {
>             for (int i=0; i<2; i++) {
>                   iw.addDocument(getDoc());
>             }
>       }
>
>       protected Document getDoc() {
>             Document doc = new Document();
>             doc.add(new Field("body", new Integer(nextDocNum).toString(),
> Field.Store.YES, Field.Index.UN_TOKENIZED));
>             doc.add(new Field("all", "x", Field.Store.YES,
> Field.Index.UN_TOKENIZED));
>             nextDocNum ++;
>             return doc;
>       }
>
> }
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message