lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1012) Problems with maxMergeDocs parameter
Date Mon, 01 Oct 2007 08:52:50 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531440
] 

Michael McCandless commented on LUCENE-1012:
--------------------------------------------

> - It seems that DocumentsWriter does not obey the maxMergeDocs
>   parameter. If I don't flush manually, then the index only contains
>   one segment at the end and the test fails.

This bug actually predates DocumentsWriter: the flushing logic has
never respected maxMergeDocs.  I think normally maxMergeDocs is far
larger than maxBufferedDocs.

To fix this we could change the flushing logic to include "# buffered
docs > maxMergeDocs" as one of its flush criteria, if the current
merge policy is a LogMergePolicy.

> - If I flush manually after each addDocument() call, then the index
>   contains more segments. But still, there are segments that contain 
>   more docs than maxMergeDocs, e. g. 55 vs. 50.

This behavior also predates the recent changes (MergePolicy, etc.), eg
the test fails on 2.1 if you flush every 6 docs (whenever "0 == i%6").

Really the current approach is better described as "any segment with
doc count greater than maxMergeDocs will not be merged".

We could just fix the javadocs to match the current approach?

Or, we could change the code to actually work the way the current
javadoc says, ie "no segment with > maxMergeDocs will ever be
created".

Though, changing the code is somewhat tricky: in order to know whether
a segment will have > maxMergeDocs after the merge is done, you must
know the delete count against each of the segments, which is somewhat
costly to compute now (you have to read the current _X_N.del file for that
segment).

Maybe we should store the deleteCount in the SegmentInfo (and save it
to segments_N); we've discussed this in the past, eg, you would also
want to do this when making a merge policy that takes deletes into
account (favors merging segments that have many deletes).

Note also that making the similar change for "maxMergeMB" is not
really feasible: you can't really compute how many MB a merged segment
will be from the input segments without just doing the merge and then
checking the resulting size.  Maybe we could make a coarse
approximation by summing input sizes of the segments (usually this is
an upper bound on final segment ssize), maybe doing proportional
reduction of this size based on delete count.  Still it would be
approaximate and you could wind up with a segment larger than
maxMergeMB.


> Problems with maxMergeDocs parameter
> ------------------------------------
>
>                 Key: LUCENE-1012
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1012
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>             Fix For: 2.3
>
>
> I found two possible problems regarding IndexWriter's maxMergeDocs value. I'm using the
following code to test maxMergeDocs:
> {code:java} 
>   public void testMaxMergeDocs() throws IOException {
>     final int maxMergeDocs = 50;
>     final int numSegments = 40;
>     
>     MockRAMDirectory dir = new MockRAMDirectory();
>     IndexWriter writer  = new IndexWriter(dir, new WhitespaceAnalyzer(), true);     

>     writer.setMergePolicy(new LogDocMergePolicy());
>     writer.setMaxMergeDocs(maxMergeDocs);
>     Document doc = new Document();
>     doc.add(new Field("field", "aaa", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
>     for (int i = 0; i < numSegments * maxMergeDocs; i++) {
>       writer.addDocument(doc);
>       //writer.flush();      // uncomment to avoid the DocumentsWriter bug
>     }
>     writer.close();
>     
>     new SegmentInfos.FindSegmentsFile(dir) {
>       protected Object doBody(String segmentFileName) throws CorruptIndexException, IOException
{
>         SegmentInfos infos = new SegmentInfos();
>         infos.read(directory, segmentFileName);
>         for (int i = 0; i < infos.size(); i++) {
>           assertTrue(infos.info(i).docCount <= maxMergeDocs);
>         }
>         return null;
>       }
>     }.run();
>   }
> {code} 
>   
> - It seems that DocumentsWriter does not obey the maxMergeDocs parameter. If I don't
flush manually, then the index only contains one segment at the end and the test fails.
> - If I flush manually after each addDocument() call, then the index contains more segments.
But still, there are segments that contain more docs than maxMergeDocs, e. g. 55 vs. 50. The
javadoc in IndexWriter says:
> {code:java}
>    /**
>    * Returns the largest number of documents allowed in a
>    * single segment.
>    *
>    * @see #setMaxMergeDocs
>    */
>   public int getMaxMergeDocs() {
>     return getLogDocMergePolicy().getMaxMergeDocs();
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message