jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shaun Barriball" <sbarr...@yahoo.co.uk>
Subject RE: Should Lucene index file size reduce when items are deleted?
Date Thu, 11 Jun 2009 08:58:48 GMT
Hi Marcel,

Marcel wrote:
"In general short living content is very well purged...."
I guess it depends on the what constitutes "short lived" as that's a
relative term. I'm guessing minutes, hours or a few days = "short lived". 

As a real world example, much of our content is editorial which lives for 4,
12 maybe 24 weeks in some cases. We recently decreased the time to live for
the archiving (deletion) for larger repositories by 50% (based on usage
analysis). 
In one case we want from 200,000 editorial items (composites of 10s of JCR
nodes) down to 70,000 editorial items. The Lucene indexes stayed around the
same physical size pre and post archive at 780MB on disk.....hence the
original post.

Marcel wrote:
"- introduce a method that lets you trigger an index optimization (as
you suggested)
- introduce a threshold for deleted nodes to live nodes ratio where an
index segment is automatically optimized

at the moment I prefer the latter because it does not require manual
interaction. WDYT?"

We'd love to have some insight into the state of the Lucene indexes as well
as the ability to influence that state in terms of house keeping.
JMX, as suggested by James, would seem to be the natural way to do that (as
it integrates nicely with enterprise monitoring solutions). I think this
could be part of a wider instrumentation strategy discussion on Jackrabbit
looking at caching et al.

Automated optimization based on a configured threshold is very useful
providing that it has a low overhead - we know that things like Java garbage
collection can hurt performance if not configured correctly. So definitely
"yes" to your "introduce a method" question and "possibly" to the automated
solution if we know it will be light.

Regards,
Shaun 


-----Original Message-----
From: mreutegg@day.com [mailto:mreutegg@day.com] On Behalf Of Marcel
Reutegger
Sent: 10 June 2009 08:12
To: users
Subject: Re: Should Lucene index file size reduce when items are deleted?

Hi,

2009/6/9 Shaun Barriball <sbarriba@yahoo.co.uk>:
> Hi Alex et al,
> Noted on the performance comment which prompts the question:
>  * what's the best way to monitor Lucene memory usage and performance to
> determine bad queries or bloated indexes - in a MySql world you could use
> Slow Query log?

there's a debug log message for
org.apache.jackrabbit.core.query.QueryImpl that includes the statement
and the time it took to execute it. if you direct that into a separate
log file and some tail/grep magic you should be able to get a log that
shows slow queries.

> And following up on the Lucene index size question.
> * Is there a way to force Jackrabbit to clean up the Lucene indexes -
assume
> we're looking to consolidate disk space for example - rather than just
> waiting for the nodes to merge?

no, there's currently no such tool. however I consider this a useful
enhancement.

> For example:
> * Is there a way to ask JackRabbit to call
>
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/index/IndexWriter.
> html#optimize()?

no, there isn't.

> * If we delete the "index" directory will JackRabbit happily reconstruct a
> consolidated index from scratch?

yes, it will. that's currently the only way to get an index with all
segments optimized.

> Some of the content in our JackRabbit repository is high volume and fairly
> transient lasting only a few weeks before being deleted hence the index
> question is more relevant for us.

In general short living content is very well purged (not just marked
as deleted) from the index because the merge policy is generational.
the longer an item lives the harder it gets to purge it from the
index. it's somewhat similar to garbage collection in java. once an
object is in perm space it is more expensive to collect it.

I currently see two options how jackrabbit could better handle your case.

- introduce a method that lets you trigger an index optimization (as
you suggested)
- introduce a threshold for deleted nodes to live nodes ratio where an
index segment is automatically optimized

at the moment I prefer the latter because it does not require manual
interaction. WDYT?

regards
 marcel

> Regards,
> Shaun
>
>
> -----Original Message-----
> From: Alexander Klimetschek [mailto:aklimets@day.com]
> Sent: 08 June 2009 13:12
> To: users@jackrabbit.apache.org
> Subject: Re: Should Lucene index file size reduce when items are deleted?
>
> On Mon, Jun 8, 2009 at 1:41 PM, Shaun Barriball<sbarriba@yahoo.co.uk>
wrote:
>> Thanks Marcel.
>>
>> From a performance and memory usage perspective, should we see the
> benefits
>> of the deletion immediately or is the Lucene performance linked to the
> index
>> file sizes (and therefore reliant on the merge happening)?
>
> Indexing structures such as the Lucene fulltext index tend to use more
> disk space to drastically enhance access (query) performance.
>
> space performance != processing time performance
>
> Regards,
> Alex
>
> --
> Alexander Klimetschek
> alexander.klimetschek@day.com
>
>


Mime
View raw message