Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 83DA99B25 for ; Thu, 5 Apr 2012 18:22:39 +0000 (UTC) Received: (qmail 19320 invoked by uid 500); 5 Apr 2012 18:22:37 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 19266 invoked by uid 500); 5 Apr 2012 18:22:37 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 19258 invoked by uid 99); 5 Apr 2012 18:22:37 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Apr 2012 18:22:37 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.215.176] (HELO mail-ey0-f176.google.com) (209.85.215.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Apr 2012 18:22:28 +0000 Received: by eaai1 with SMTP id i1so515694eaa.35 for ; Thu, 05 Apr 2012 11:22:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-originating-ip:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=BDnYa6fQb0mAOkhc8Sqi36X/Iww9VUTjtXKu4vCB6Yo=; b=aJBw4aPijFiXJi04LQai0turgEuw65KXrfMnq9VdrFK/A1FB+98asHHYidh2k61Sdt E7Nb2l+3XKkbI43ZRrTtwI4AF36BNKWn8MKypSciT0FWFWGmuVM7UdbzrLdhrSW1Qxjy YYYet4x5hemBgrumIfhogHV+LFQTmG96v0NuTJyx9adDWTkOg4Am4dDlMwBxNvA1aSEt DEJUH4RD4MxgyQ8xs1oBcJDiHkv1EC1K2ubcKYr6CyyzcfWmwdnf4MSzGlTuLI//Feis Ydil7aNlVJi58lP7rwI/V+a2dhWg/4oEeju2kS+oD41b/9Dhrq9/N7AbkVaSS8nbVIeY D46g== MIME-Version: 1.0 Received: by 10.213.28.67 with SMTP id l3mr495968ebc.293.1333650127310; Thu, 05 Apr 2012 11:22:07 -0700 (PDT) Received: by 10.14.99.65 with HTTP; Thu, 5 Apr 2012 11:22:07 -0700 (PDT) X-Originating-IP: [12.133.176.59] Date: Thu, 5 Apr 2012 11:22:07 -0700 Message-ID: Subject: Slow merging after upgrading to 3.5 From: Ivan Brusic To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQlw447hvrQWGIVx/YyhJfNiGJOeMSkmPq8LlSqM7WQTYGAgAMnl1dDBHQ7uZrP1twTNrnVC I recently migrated a legacy Lucene application from 2.3 to 3.5. The code was filled with numerous custom filter/analyzers/similarites/collectors. Took about a week to convert all the token streams to the new API and removed deprecated classes. Most importantly, there is a collector that enables faceting, which I suspect might be taken from Solr (never looked into the Solr source code). The index is built as a batch process with no searchers using it. The index contains 30+million documents for a total size around 45gb. The bulk of the indexing time is during the database calls. The build time using Lucene 2.3 was around 10 hours. The code has a collector similar to TimeLimitingCollector (sadly, there is a ton of custom built code) which collects documents until it reaches a limit. The way the current index is created, it is essential that the most important documents (based on business rules) exist at the beginning of an index (insertion order) to ensure that the appear even if the collector times out. The first issue we noticed is that this distribution (which I admit is a hack) is no longer "correct" using the default TieredMergePolicy. We switched back the log policy to the existing setup of LogByteSizeMergePolicy with a merge factor of 2. I am assuming the low merge factor is responsible for creating indices that respect the insertion order of documents. Documents are now in the correct order, but a optimize (aka forceMerge(1)) takes around 5 hours were previously there was no slowdown. If we remove the forceMerge, the commit time takes just as long. It is difficult to iterate through different settings since waiting 14-15 hours between tests to see the results is too long. What is the best way to create an optimized index that places documents based on insertion order at the beginning? The answer should be to write better queries, but none of the authors of this legacy jumbled code base are around and we want to avoid rocking the boat on the query side since the existing search results are satisfactory. Cheers, Ivan --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org