Return-Path: Delivered-To: apmail-jackrabbit-users-archive@minotaur.apache.org Received: (qmail 58583 invoked from network); 11 Jun 2009 12:08:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 Jun 2009 12:08:15 -0000 Received: (qmail 29286 invoked by uid 500); 11 Jun 2009 12:08:25 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 29260 invoked by uid 500); 11 Jun 2009 12:08:25 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 29249 invoked by uid 99); 11 Jun 2009 12:08:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Jun 2009 12:08:25 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mreutegg@day.com designates 207.126.148.181 as permitted sender) Received: from [207.126.148.181] (HELO eu3sys201aog001.obsmtp.com) (207.126.148.181) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 11 Jun 2009 12:08:16 +0000 Received: from source ([209.85.218.216]) by eu3sys201aob001.postini.com ([207.126.154.11]) with SMTP ID DSNKSjDzmZI/clcq0sgwtykD68KhPW+joFSG@postini.com; Thu, 11 Jun 2009 12:07:55 UTC Received: by bwz12 with SMTP id 12so1432961bwz.17 for ; Thu, 11 Jun 2009 05:07:53 -0700 (PDT) MIME-Version: 1.0 Sender: mreutegg@day.com Received: by 10.223.103.207 with SMTP id l15mr2223804fao.2.1244722073041; Thu, 11 Jun 2009 05:07:53 -0700 (PDT) In-Reply-To: <00c301c9ea72$d57de9e0$8079bda0$@co.uk> References: <000001c9e778$7bd5c9f0$73815dd0$@co.uk> <008301c9e82e$1ec5a9e0$5c50fda0$@co.uk> <007001c9e8e8$64eb9ff0$2ec2dfd0$@co.uk> <00c301c9ea72$d57de9e0$8079bda0$@co.uk> Date: Thu, 11 Jun 2009 14:07:52 +0200 X-Google-Sender-Auth: 4ca57060afab6c6b Message-ID: Subject: Re: Should Lucene index file size reduce when items are deleted? From: Marcel Reutegger To: users@jackrabbit.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi, On Thu, Jun 11, 2009 at 10:58, Shaun Barriball wrote: > Hi Marcel, > > Marcel wrote: > "In general short living content is very well purged...." > I guess it depends on the what constitutes "short lived" as that's a > relative term. I'm guessing minutes, hours or a few days = "short lived". it is certainly a relative term, and not actually time wise but compared to all other modification that also happen on the workspace. i.e. it doesn't matter whether an item lives for a few days or only a few seconds and then gets removed. more importantly is what else happens during that time. if nothing is changed in the workspace then in both cases the space in the index for the node is quickly freed. however, if lots of other changes happen during that time, then it will take more time until the node is actually removed from the index. > As a real world example, much of our content is editorial which lives for 4, > 12 maybe 24 weeks in some cases. We recently decreased the time to live for > the archiving (deletion) for larger repositories by 50% (based on usage > analysis). > In one case we want from 200,000 editorial items (composites of 10s of JCR > nodes) down to 70,000 editorial items. The Lucene indexes stayed around the > same physical size pre and post archive at 780MB on disk.....hence the > original post. > > Marcel wrote: > "- introduce a method that lets you trigger an index optimization (as > you suggested) > - introduce a threshold for deleted nodes to live nodes ratio where an > index segment is automatically optimized > > at the moment I prefer the latter because it does not require manual > interaction. WDYT?" > > We'd love to have some insight into the state of the Lucene indexes as well > as the ability to influence that state in terms of house keeping. > JMX, as suggested by James, would seem to be the natural way to do that (as > it integrates nicely with enterprise monitoring solutions). I think this > could be part of a wider instrumentation strategy discussion on Jackrabbit > looking at caching et al. OK, thanks for the feedback. I'll create a JIRA issue for that. > Automated optimization based on a configured threshold is very useful > providing that it has a low overhead - we know that things like Java garbage > collection can hurt performance if not configured correctly. So definitely > "yes" to your "introduce a method" question and "possibly" to the automated > solution if we know it will be light. the solution I have in mind would integrate with the existing background merging of indexes. this is done in a background thread and does not have a significant effect on performance. regards marcel