lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Wang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1634) LogMergePolicy should use the number of deleted docs when deciding which segments to merge
Date Wed, 13 May 2009 16:49:45 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709002#action_12709002
] 

John Wang commented on LUCENE-1634:
-----------------------------------

RE: implementing custom MergePolicy
Let me describe in detail on problems of implementing a custom MergePolicy:

1) In IndexWriter code, such methods on MergePolicy is called, e.g. findMergesForOptimize.
I believe that is the contract for implementing your own MergePolicy. However, it is "hidden"
by the javadoc in terms of documentation, and furthermore, it is hidden because these methods
are package protected. So to implement your own MergePolicy, you have to resort back to sneaking
the class into the package.

2) Not only seg/getUseCompoundFile is no longer applicable if LogMergePolicy is not used,
also popular methods such as set/getMergeFactor etc. are only applicable to LogMergePolicy.
(Just to clarify, useCompoundFile is a package-level protected method on the base MergePolicy
class, so my guess is that set/getCompoundFile should be applicable to all implementations
of MergePolicy.

This brings up another issue about the practice of having to "sneak" classes into a package.
We are looking at making our Lucene code, OSGI compliant, and this becomes an issue because
we cannot have multiple "bundles" exporting the same package. Which means, I would have to
repackage lucene to include my classes that I have snuck into some lucene packages. I would
like to use a standard distribution of  a lucene jar (as suggested/echoed by some luceners).


> LogMergePolicy should use the number of deleted docs when deciding which segments to
merge
> ------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1634
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1634
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Yasuhiro Matsuda
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1634.patch
>
>
> I found that IndexWriter.optimize(int) method does not pick up large segments with a
lot of deletes even when most of the docs are deleted. And the existence of such segments
affected the query performance significantly.
> I created an index with 1 million docs, then went over all docs and updated a few thousand
at a time.  I ran optimize(20) occasionally. What saw were large segments with most of docs
deleted. Although these segments did not have valid docs they remained in the directory for
a very long time until more segments with comparable or bigger sizes were created.
> This is because LogMergePolicy.findMergeForOptimize uses the size of segments but does
not take the number of deleted documents into consideration when it decides which segments
to merge. So, a simple fix is to use the delete count to calibrate the segment size. I can
create a patch for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message