Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
Date: Thu, 5 Apr 2012 11:22:07 -0700
Message-ID: 
 <CALY=cQB=GXRLwhcOR1=BSEZ8T2tf9FqPFFk-=uhUcYqwuJR4Qw@mail.gmail.com>
Subject: Slow merging after upgrading to 3.5
From: Ivan Brusic <ivan@brusic.com>
To: java-user@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1

I recently migrated a legacy Lucene application from 2.3 to 3.5. The
code was filled with numerous custom
filter/analyzers/similarites/collectors. Took about a week to convert
all the token streams to the new API and removed deprecated classes.
Most importantly, there is a collector that enables faceting, which I
suspect might be taken from Solr (never looked into the Solr source
code).

The index is built as a batch process with no searchers using it. The
index contains 30+million documents for a total size around 45gb. The
bulk of the indexing time is during the database calls. The build time
using Lucene 2.3 was around 10 hours.

The code has a collector similar to TimeLimitingCollector (sadly,
there is a ton of custom built code) which collects documents until it
reaches a limit. The way the current index is created, it is essential
that the most important documents (based on business rules) exist at
the beginning of an index (insertion order) to ensure that the appear
even if the collector times out. The first issue we noticed is that
this distribution (which I admit is a hack) is no longer "correct"
using the default TieredMergePolicy. We switched back the log policy
to the existing setup of LogByteSizeMergePolicy with a merge factor of
2. I am assuming the low merge factor is responsible for creating
indices that respect the insertion order of documents. Documents are
now in the correct order, but a optimize (aka forceMerge(1)) takes
around 5 hours were previously there was no slowdown. If we remove the
forceMerge, the commit time takes just as long.

It is difficult to iterate through different settings since waiting
14-15 hours between tests to see the results is too long. What is the
best way to create an optimized index that places documents based on
insertion order at the beginning? The answer should be to write better
queries, but none of the authors of this legacy jumbled code base are
around and we want to avoid rocking the boat on the query side since
the existing search results are satisfactory.

Cheers,

Ivan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org