Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@jackrabbit.apache.org
Received-SPF: pass (athena.apache.org: domain of mreutegg@day.com designates
 207.126.148.181 as permitted sender)
MIME-Version: 1.0
Sender: mreutegg@day.com
In-Reply-To: <00c301c9ea72$d57de9e0$8079bda0$@co.uk>
References: <000001c9e778$7bd5c9f0$73815dd0$@co.uk>
	 <a781e7950906080137k3edd43b6ub29348b20ff91e71@mail.gmail.com>
	 <008301c9e82e$1ec5a9e0$5c50fda0$@co.uk>
	 <c3ac3bad0906080512o5121a6e6o34f5e8384509e448@mail.gmail.com>
	 <007001c9e8e8$64eb9ff0$2ec2dfd0$@co.uk>
	 <a781e7950906100011g229fa442r90affdffac59d96d@mail.gmail.com>
	 <00c301c9ea72$d57de9e0$8079bda0$@co.uk>
Date: Thu, 11 Jun 2009 14:07:52 +0200
Message-ID: <a781e7950906110507h37946367oe7536c889b991131@mail.gmail.com>
Subject: Re: Should Lucene index file size reduce when items are deleted?
From: Marcel Reutegger <marcel.reutegger@gmx.net>
To: users@jackrabbit.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Hi,

On Thu, Jun 11, 2009 at 10:58, Shaun Barriball<sbarriba@yahoo.co.uk> wrote:
> Hi Marcel,
>
> Marcel wrote:
> "In general short living content is very well purged...."
> I guess it depends on the what constitutes "short lived" as that's a
> relative term. I'm guessing minutes, hours or a few days = "short lived".

it is certainly a relative term, and not actually time wise but
compared to all other modification that also happen on the workspace.
i.e. it doesn't matter whether an item lives for a few days or only a
few seconds and then gets removed. more importantly is what else
happens during that time. if nothing is changed in the workspace then
in both cases the space in the index for the node is quickly freed.
however, if lots of other changes happen during that time, then it
will take more time until the node is actually removed from the index.

> As a real world example, much of our content is editorial which lives for 4,
> 12 maybe 24 weeks in some cases. We recently decreased the time to live for
> the archiving (deletion) for larger repositories by 50% (based on usage
> analysis).
> In one case we want from 200,000 editorial items (composites of 10s of JCR
> nodes) down to 70,000 editorial items. The Lucene indexes stayed around the
> same physical size pre and post archive at 780MB on disk.....hence the
> original post.
>
> Marcel wrote:
> "- introduce a method that lets you trigger an index optimization (as
> you suggested)
> - introduce a threshold for deleted nodes to live nodes ratio where an
> index segment is automatically optimized
>
> at the moment I prefer the latter because it does not require manual
> interaction. WDYT?"
>
> We'd love to have some insight into the state of the Lucene indexes as well
> as the ability to influence that state in terms of house keeping.
> JMX, as suggested by James, would seem to be the natural way to do that (as
> it integrates nicely with enterprise monitoring solutions). I think this
> could be part of a wider instrumentation strategy discussion on Jackrabbit
> looking at caching et al.

OK, thanks for the feedback. I'll create a JIRA issue for that.

> Automated optimization based on a configured threshold is very useful
> providing that it has a low overhead - we know that things like Java garbage
> collection can hurt performance if not configured correctly. So definitely
> "yes" to your "introduce a method" question and "possibly" to the automated
> solution if we know it will be light.

the solution I have in mind would integrate with the existing
background merging of indexes. this is done in a background thread and
does not have a significant effect on performance.

regards
 marcel