Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 36414 invoked from network); 1 Oct 2007 08:53:14 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 1 Oct 2007 08:53:14 -0000 Received: (qmail 84015 invoked by uid 500); 1 Oct 2007 08:53:01 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 83966 invoked by uid 500); 1 Oct 2007 08:53:01 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 83955 invoked by uid 99); 1 Oct 2007 08:53:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Oct 2007 01:53:01 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Oct 2007 08:53:11 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id CE85971420E for ; Mon, 1 Oct 2007 01:52:50 -0700 (PDT) Message-ID: <25055766.1191228770843.JavaMail.jira@brutus> Date: Mon, 1 Oct 2007 01:52:50 -0700 (PDT) From: "Michael McCandless (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1012) Problems with maxMergeDocs parameter In-Reply-To: <4611634.1191218990862.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12531440 ] Michael McCandless commented on LUCENE-1012: -------------------------------------------- > - It seems that DocumentsWriter does not obey the maxMergeDocs > parameter. If I don't flush manually, then the index only contains > one segment at the end and the test fails. This bug actually predates DocumentsWriter: the flushing logic has never respected maxMergeDocs. I think normally maxMergeDocs is far larger than maxBufferedDocs. To fix this we could change the flushing logic to include "# buffered docs > maxMergeDocs" as one of its flush criteria, if the current merge policy is a LogMergePolicy. > - If I flush manually after each addDocument() call, then the index > contains more segments. But still, there are segments that contain > more docs than maxMergeDocs, e. g. 55 vs. 50. This behavior also predates the recent changes (MergePolicy, etc.), eg the test fails on 2.1 if you flush every 6 docs (whenever "0 == i%6"). Really the current approach is better described as "any segment with doc count greater than maxMergeDocs will not be merged". We could just fix the javadocs to match the current approach? Or, we could change the code to actually work the way the current javadoc says, ie "no segment with > maxMergeDocs will ever be created". Though, changing the code is somewhat tricky: in order to know whether a segment will have > maxMergeDocs after the merge is done, you must know the delete count against each of the segments, which is somewhat costly to compute now (you have to read the current _X_N.del file for that segment). Maybe we should store the deleteCount in the SegmentInfo (and save it to segments_N); we've discussed this in the past, eg, you would also want to do this when making a merge policy that takes deletes into account (favors merging segments that have many deletes). Note also that making the similar change for "maxMergeMB" is not really feasible: you can't really compute how many MB a merged segment will be from the input segments without just doing the merge and then checking the resulting size. Maybe we could make a coarse approximation by summing input sizes of the segments (usually this is an upper bound on final segment ssize), maybe doing proportional reduction of this size based on delete count. Still it would be approaximate and you could wind up with a segment larger than maxMergeMB. > Problems with maxMergeDocs parameter > ------------------------------------ > > Key: LUCENE-1012 > URL: https://issues.apache.org/jira/browse/LUCENE-1012 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Reporter: Michael Busch > Priority: Minor > Fix For: 2.3 > > > I found two possible problems regarding IndexWriter's maxMergeDocs value. I'm using the following code to test maxMergeDocs: > {code:java} > public void testMaxMergeDocs() throws IOException { > final int maxMergeDocs = 50; > final int numSegments = 40; > > MockRAMDirectory dir = new MockRAMDirectory(); > IndexWriter writer = new IndexWriter(dir, new WhitespaceAnalyzer(), true); > writer.setMergePolicy(new LogDocMergePolicy()); > writer.setMaxMergeDocs(maxMergeDocs); > Document doc = new Document(); > doc.add(new Field("field", "aaa", Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS)); > for (int i = 0; i < numSegments * maxMergeDocs; i++) { > writer.addDocument(doc); > //writer.flush(); // uncomment to avoid the DocumentsWriter bug > } > writer.close(); > > new SegmentInfos.FindSegmentsFile(dir) { > protected Object doBody(String segmentFileName) throws CorruptIndexException, IOException { > SegmentInfos infos = new SegmentInfos(); > infos.read(directory, segmentFileName); > for (int i = 0; i < infos.size(); i++) { > assertTrue(infos.info(i).docCount <= maxMergeDocs); > } > return null; > } > }.run(); > } > {code} > > - It seems that DocumentsWriter does not obey the maxMergeDocs parameter. If I don't flush manually, then the index only contains one segment at the end and the test fails. > - If I flush manually after each addDocument() call, then the index contains more segments. But still, there are segments that contain more docs than maxMergeDocs, e. g. 55 vs. 50. The javadoc in IndexWriter says: > {code:java} > /** > * Returns the largest number of documents allowed in a > * single segment. > * > * @see #setMaxMergeDocs > */ > public int getMaxMergeDocs() { > return getLogDocMergePolicy().getMaxMergeDocs(); > } > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org