lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <DOR...@il.ibm.com>
Subject Re: flushRamSegments() is "over merging"?
Date Wed, 16 Aug 2006 07:03:47 GMT
Thanks Yonik, you're right, I got confused with the merge factor.

My (corrected) interpretation of merge-factor is - rank of an imaginary
merge tree - controlling how many segments are merged to create a larger
segment. This way it balances io for merging during indexing, vs. io during
search.

You are saying (in my words:-) that 'over merging' is not an issue, because
if one sets a large value for merge factor, it means that many documents
can be merged at once, and you are more worried by too few merges, as in
the 10 vs. 11 example you provide below for flushRamSegments(), and as in
the pointed discussion 388.

Under-merging would hurt search, unless optimize is called explicitly, but
the index should "behave" without requiring the user to call optimize. 388
deals with this.

Over-merging - in current flushRamSegments() code - would merge at most
merge-factor documents prematurely. Since merge-fatcor is usually not very
large, this might be a minor issue - but still, if an index is growing by
small doses, does it make sense to re-merge with the last disk segment each
time the index is closed? Why not letting it be simply controlled by
maybeMergeSegments?

Thanks,
Doron

yseeley@gmail.com wrote on 15/08/2006 08:29:53:

> Related to merging more often than one would expect, check out my last
> comment in this bug:
> http://issues.apache.org/jira/browse/LUCENE-388
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search
server
>
> On 8/15/06, Yonik Seeley <yonik@apache.org> wrote:
> > Yes, that's counter-intuitive.... a high merge factor is more likely
> > to cause a merge with the last disk-based segment.
> >
> > On the other hand... if you have a high maxBufferedDocs and a normal
> > mergeFactor (much more likely), you could end up with way too many
> > segments if you didn't merge.
> >
> > Hmmm, I'm thinking of another case where you could end up with far too
> > many segments... if you have a low merge factor and high
> > maxBufferedDocs (a common scenario), then if you add enough docs it
> > will keep creating a separate segment.
> >
> > Consider the following settings:
> > mergeFactor=10
> > maxBufferedDocs=10000
> >
> > Now add 11 docs at a time to an existing index, closing inbetween.
> > segment sizes: 100000, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
> > 11, 11, ...
> >
> > It seems like the merge logic somewhere should also take into account
> > the number of segments at a certain level, not just the number of
> > documents in those segments.
> >
> > -Yonik
> >
> > On 8/15/06, Doron Cohen <DORONC@il.ibm.com> wrote:
> > >
> > > Hi, I ran into this while reviewing the patch for 565.
> > >
> > > It appears that closing an index writer with non empty ram segments
(at
> > > least 1 doc was added) is causing a merge with the last (most recent)
on
> > > disk segment.
> > >
> > > This seems to me problematic in the case that an application has a
lot of
> > > interleaving - adding / removing documents, or even switching
indexes,
> > > therefore the indexWriter would be closed often.
> > >
> > > The test case below demonstrates this behavior - all maxBufferedDocs,
> > > maxMergeDocs, mergeFactor are assigned very large values, and in a
loop a
> > > few documents are added and the indexWriter is closed and re-opened.
> > >
> > > Surprisingly (at least for me) the number of segments on disk remains
1.
> > > In other words, each time the IndexWriter is closed, the single disk
> > > segment is merged with the current ram segments and re-written to a
new
> > > disk segments.
> > >
> > > The "blame" is in the second line here:
> > >     if (minSegment < 0 ||                   // add one FS segment?
> > >         (docCount + segmentInfos.info(minSegment).docCount) >
mergeFactor
> > > ||
> > >         !(segmentInfos.info(segmentInfos.size()-1).dir ==
ramDirectory))
> > >
> > > This code in flushRamSegments() merges the (temporary) ram segments
with
> > > the most recent non-temporary segment.
> > >
> > > I can see how this can make sense in some cases. Perhaps an
additional
> > > constraint should be added on the ratio of the size of this non-temp
> > > segment to that of all temporary segments, or the difference, or
both.
> > >
> > > Here is the test case,
> > > Thanks,
> > > Doron
> > > ------------------------------------
> > > package org.apache.lucene.index;
> > >
> > > import java.io.IOException;
> > >
> > > import junit.framework.TestCase;
> > >
> > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > import org.apache.lucene.document.Document;
> > > import org.apache.lucene.document.Field;
> > > import org.apache.lucene.store.Directory;
> > > import org.apache.lucene.store.FSDirectory;
> > > import org.apache.lucene.store.RAMDirectory;
> > >
> > > /**
> > >  * Test that the number of segments is as expected.
> > >  * I.e. that there was not too many / too few merges.
> > >  *
> > >  * @author Doron Cohen
> > >  */
> > > public class TestNumSegments extends TestCase {
> > >
> > >       protected int nextDocNum = 0;
> > >       protected Directory dir = null;
> > >       protected IndexWriter iw = null;
> > >       protected IndexReader ir = null;
> > >
> > >       /* (non-Javadoc)
> > >        * @see junit.framework.TestCase#setUp()
> > >        */
> > >       protected void setUp() throws Exception {
> > >             super.setUp();
> > >             //dir = new RAMDirectory();
> > >             dir = FSDirectory.getDirectory("test.num.segments",true);
> > >             iw = new IndexWriter(dir, new StandardAnalyzer(), true);
> > >             setLimits(iw);
> > >             addSomeDocs(); // some docs in index
> > >       }
> > >
> > >       // for now, take these limits out of the "game"
> > >       protected void setLimits(IndexWriter iw) {
> > >             iw.setMaxBufferedDocs(Integer.MAX_VALUE-1);
> > >             iw.setMaxMergeDocs(Integer.MAX_VALUE-1);
> > >             iw.setMergeFactor(Integer.MAX_VALUE-1);
> > >       }
> > >
> > >       /* (non-Javadoc)
> > >        * @see junit.framework.TestCase#tearDown()
> > >        */
> > >       protected void tearDown() throws Exception {
> > >             closeW();
> > >             if (dir!=null) {
> > >                   dir.close();
> > >             }
> > >             super.tearDown();
> > >       }
> > >
> > >       // count how many segments are on a directory - index writer
must be
> > > closed
> > >       protected int countDirSegments() throws IOException {
> > >             assertNull(iw);
> > >             SegmentInfos segmentInfos = new SegmentInfos();
> > >             segmentInfos.read(dir);
> > >             int nSegs = segmentInfos.size();
> > >             segmentInfos.clear();
> > >             return nSegs;
> > >       }
> > >
> > >       // open writer
> > >       private void openW() throws IOException {
> > >             iw = new IndexWriter(dir, new StandardAnalyzer(), false);
> > >             setLimits(iw);
> > >       }
> > >
> > >       private void closeW() throws IOException {
> > >             if (iw!=null) {
> > >                   iw.close();
> > >                   iw=null;
> > >             }
> > >       }
> > >
> > >       public void testNumSegments() throws IOException {
> > >             int numExceptions = 0;
> > >             for (int i=1; i<30; i++) {
> > >                   closeW();
> > >                   try {
> > >                         assertEquals("Oops - wrong number of
> segments!", i,
> > > countDirSegments());
> > >                   } catch (Throwable t) {
> > >                         numExceptions++;
> > >                         System.err.println(i+":  "+t.getMessage());
> > >                   }
> > >                   openW();
> > >                   addSomeDocs();
> > >             }
> > >             assertEquals("Oops!, so many times numbr of egments was
> > > \"wrong\"",0,numExceptions);
> > >       }
> > >
> > >       private void addSomeDocs() throws IOException {
> > >             for (int i=0; i<2; i++) {
> > >                   iw.addDocument(getDoc());
> > >             }
> > >       }
> > >
> > >       protected Document getDoc() {
> > >             Document doc = new Document();
> > >             doc.add(new Field("body", new
Integer(nextDocNum).toString(),
> > > Field.Store.YES, Field.Index.UN_TOKENIZED));
> > >             doc.add(new Field("all", "x", Field.Store.YES,
> > > Field.Index.UN_TOKENIZED));
> > >             nextDocNum ++;
> > >             return doc;
> > >       }
> > >
> > > }
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-dev-help@lucene.apache.org
> > >
> > >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message