lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Slow merging after upgrading to 3.5
Date Thu, 05 Apr 2012 18:36:42 GMT
I'm assuming this is a "build once and never change" index...?  Else,
it sounds like you should never run forceMerge...

To preserve insertion order you just need to use one of the
Log*MergePolicy (which you are already doing).  Merge factor doesn't
affect this...

For the fastest way to get to a single-segment index.... use
NoMergePolicy while indexing the documents, and set the largest RAM
buffer you can afford.  This will create tons of segments in the index
dir, which is fine as long as you will not open a reader on it...
then:

Open a new IW, with Log*MergePolicy, set a highish (maybe 30)
mergeFactor, and call forceMerge(1).  You may need to cutover to
SerialMergeScheduler...

Mike McCandless

http://blog.mikemccandless.com

On Thu, Apr 5, 2012 at 2:22 PM, Ivan Brusic <ivan@brusic.com> wrote:
> I recently migrated a legacy Lucene application from 2.3 to 3.5. The
> code was filled with numerous custom
> filter/analyzers/similarites/collectors. Took about a week to convert
> all the token streams to the new API and removed deprecated classes.
> Most importantly, there is a collector that enables faceting, which I
> suspect might be taken from Solr (never looked into the Solr source
> code).
>
> The index is built as a batch process with no searchers using it. The
> index contains 30+million documents for a total size around 45gb. The
> bulk of the indexing time is during the database calls. The build time
> using Lucene 2.3 was around 10 hours.
>
> The code has a collector similar to TimeLimitingCollector (sadly,
> there is a ton of custom built code) which collects documents until it
> reaches a limit. The way the current index is created, it is essential
> that the most important documents (based on business rules) exist at
> the beginning of an index (insertion order) to ensure that the appear
> even if the collector times out. The first issue we noticed is that
> this distribution (which I admit is a hack) is no longer "correct"
> using the default TieredMergePolicy. We switched back the log policy
> to the existing setup of LogByteSizeMergePolicy with a merge factor of
> 2. I am assuming the low merge factor is responsible for creating
> indices that respect the insertion order of documents. Documents are
> now in the correct order, but a optimize (aka forceMerge(1)) takes
> around 5 hours were previously there was no slowdown. If we remove the
> forceMerge, the commit time takes just as long.
>
> It is difficult to iterate through different settings since waiting
> 14-15 hours between tests to see the results is too long. What is the
> best way to create an optimized index that places documents based on
> insertion order at the beginning? The answer should be to write better
> queries, but none of the authors of this legacy jumbled code base are
> around and we want to avoid rocking the boat on the query side since
> the existing search results are satisfactory.
>
> Cheers,
>
> Ivan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message