Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Message-ID: <32219021.45481273856928935.JavaMail.jira@thor>
Date: Fri, 14 May 2010 13:08:48 -0400 (EDT)
From: "Michael McCandless (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-2455) Some house cleaning in addIndexes*
In-Reply-To: <10732753.13841273550429349.JavaMail.jira@thor>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867567#action_12867567 ] 

Michael McCandless commented on LUCENE-2455:
--------------------------------------------

bq.  I understand why it's called in addIndexes(Dir), because the local segments are also touched. But now in the Reader version, they aren't. So it looked odd to me that we flush whatever is in RAM. 

Yeah maybe we should no longer flush (but still call start/commitTransaction).  I think there may've been a reason to flush first (besides that we were also merging local segments)... but I can't remember it.  If you comment out that assert (and the corresponding assert for deletions) do any tests fail?

bq. I think you said once that addIndexes should have done the merge outside, adding the new segment when it's done?

Yes, I would love to fix this -- it'd mean we would not need the start/commit/rollbackTransaction code.

Ie, we play a dangerous game now, where addIndexes is allowed to muck with the in-memory SegmentInfos before it's complete.  It'd be better if all merging happened outside of its SegmentInfos, and only when addIndexes finished, it'd atomically commit to SegmentInfos.

This would then allow commit() to run immediately, not having to wait for any running addIndexes to finish first.  And we would not need to block add/updateDocument nor deleteDocuments while addIndexes is running.

So, actually, I think in addIndexes(IR...) you should not use the transaction logic at all?  Just do the merge externally & commit in the end?  (And try not flushing as well.).

> Some house cleaning in addIndexes*
> ----------------------------------
>
>                 Key: LUCENE-2455
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2455
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>            Priority: Trivial
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2455_3x.patch
>
>
> Today, the use of addIndexes and addIndexesNoOptimize is confusing - 
> especially on when to invoke each. Also, addIndexes calls optimize() in 
> the beginning, but only on the target index. It also includes the 
> following jdoc statement, which from how I understand the code, is 
> wrong: _After this completes, the index is optimized._ -- optimize() is 
> called in the beginning and not in the end. 
> On the other hand, addIndexesNoOptimize does not call optimize(), and 
> relies on the MergeScheduler and MergePolicy to handle the merges. 
> After a short discussion about that on the list (Thanks Mike for the 
> clarifications!) I understand that there are really two core differences 
> between the two: 
> * addIndexes supports IndexReader extensions
> * addIndexesNoOptimize performs better
> This issue proposes the following:
> # Clear up the documentation of each, spelling out the pros/cons of 
>   calling them clearly in the javadocs.
> # Rename addIndexesNoOptimize to addIndexes
> # Remove optimize() call from addIndexes(IndexReader...)
> # Document that clearly in both, w/ a recommendation to call optimize() 
>   before on any of the Directories/Indexes if it's a concern. 
> That way, we maintain all the flexibility in the API - 
> addIndexes(IndexReader...) allows for using IR extensions, 
> addIndexes(Directory...) is considered more efficient, by allowing the 
> merges to happen concurrently (depending on MS) and also factors in the 
> MP. So unless you have an IR extension, addDirectories is really the one 
> you should be using. And you have the freedom to call optimize() before 
> each if you care about it, or don't if you don't care. Either way, 
> incurring the cost of optimize() is entirely in the user's hands. 
> BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler 
> nor MergePolicy, but rather call SegmentMerger directly. This might be 
> another place for improvement. I'll look into it, and if it's not too 
> complicated, I may cover it by this issue as well. If you have any hints 
> that can give me a good head start on that, please don't be shy :). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org