jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shaun Barriball" <sbarr...@yahoo.co.uk>
Subject RE: Should Lucene index file size reduce when items are deleted?
Date Thu, 11 Jun 2009 11:37:15 GMT
Hi Ard,
Firstly - hands-up - I'm ONLY familiar with the purpose of Lucene within
JackRabbit, not with the key factors which affect its performance.

I should have probably provided some additional background. We've been
operationally seeing:
 * some JCR queries gradually slowing over time as volume of content
increases
 * increased locking contention within ItemManager manipulating large
numbers of nodes
 * increased disk IO and CPU IO wait in some scenarios which correlates with
lots of threads reading Lucene indexes

Our theory, based on the symptoms and some searching, was that the memory
footprint of Lucene for our application was increasingly contributing to the
above....and we naively used the Lucene size on disk as a relative measure
of the memory footprint that a particular JackRabbit workspace would have. 

Disk Space issue
-----------------
>From your comments and Marcel's response I'm now clear that the size on disk
it not 'necessarily' a concern for performance except that it IS: 
 a) a potential consideration in terms of amount of disk IO for 'bloated'
index files and OS resources, and
 b) it is an issue for disk space when you're deploying 100 workspaces and
the index sizes are growing regardless of the archiving strategy. (I'll
ignore explaining why we may have that many workspaces.)

Performance Issue
------------------
So going back to the original background issue you've provided a great
insight to known performance issues with the JackRabbit/Lucene pairing, and
you've referenced JackRabbit 1.4.5. So the key question is, short of us
reviewing any lucene related commits to JackRabbit, does the latest 1.5.*
release contain significant improvements over 1.4.5 in this area?

Regards,
Shaun









-----Original Message-----
From: Ard Schrijvers [mailto:a.schrijvers@onehippo.com] 
Sent: 11 June 2009 12:00
To: users@jackrabbit.apache.org
Subject: Re: Should Lucene index file size reduce when items are deleted?

Hello,

780 MB for a Lucene index is not really big.  Obviously, a larger FS
index won't make Lucene faster, but at the same time, performance
should not be affected that much either. Why do you think optimizing
would save you that much?

Also this part:

"The underlying concern primarily is performance and keeping the licence
indexes small enough to fit in 100% in memory, disk space being a secondary
consideration."

I do not understand the license indexes keeping in memory?

Furthermore, there certainly are memory issues. These though are also
related to general Lucene issues imo: Solr has similar issues.

For example: if you have 1.000.000 documents, do a query and sort on
the title. Suppose a String title is on average in memory around 1 kb.
When sorting on the title, first *all* title terms are read into an
array in memory: 1 m * 1kb =3D 1 Gb memory...as Lucene readers are never
being reopened in jackrabbi, this memory is *not* returned (only
exception is when indexes merges).

Similar reasoning holds for the memory usage of some other parts. I
think for version 1.4.5 you can even fill all memory by sorting on non
existing properties, as there would be created String arrays for the
length of the lucene maxdoc, containing only null values...something
like String[] s =3D new String[10000000] ) (suppose you have 1.000.000
documents, where on average one document results in 10 nodes
(versioning, subtree nodes, etc etc))

I hope to have time to work on this in a couple of months. This is IMO
the actual issue, where 780 Mb is imo not big, and optimizing won't
give you what you would expect, at least, that is what I think...and
am convinced of

Ard

ps We have Lucene indexes with JackRabbit up to 10 Gb. The size is not the
issue

On Thu, Jun 11, 2009 at 10:58 AM, Shaun Barriball<sbarriba@yahoo.co.uk>
wrote:
> Hi Marcel,
>
> Marcel wrote:
> "In general short living content is very well purged...."
> I guess it depends on the what constitutes "short lived" as that's a
> relative term. I'm guessing minutes, hours or a few days = "short lived".
>
> As a real world example, much of our content is editorial which lives for
4,
> 12 maybe 24 weeks in some cases. We recently decreased the time to live
for
> the archiving (deletion) for larger repositories by 50% (based on usage
> analysis).
> In one case we want from 200,000 editorial items (composites of 10s of JCR
> nodes) down to 70,000 editorial items. The Lucene indexes stayed around
the
> same physical size pre and post archive at 780MB on disk.....hence the
> original post.
>
> Marcel wrote:
> "- introduce a method that lets you trigger an index optimization (as
> you suggested)
> - introduce a threshold for deleted nodes to live nodes ratio where an
> index segment is automatically optimized
>
> at the moment I prefer the latter because it does not require manual
> interaction. WDYT?"
>
> We'd love to have some insight into the state of the Lucene indexes as
well
> as the ability to influence that state in terms of house keeping.
> JMX, as suggested by James, would seem to be the natural way to do that
(as
> it integrates nicely with enterprise monitoring solutions). I think this
> could be part of a wider instrumentation strategy discussion on Jackrabbit
> looking at caching et al.
>
> Automated optimization based on a configured threshold is very useful
> providing that it has a low overhead - we know that things like Java
garbage
> collection can hurt performance if not configured correctly. So definitely
> "yes" to your "introduce a method" question and "possibly" to the
automated
> solution if we know it will be light.
>
> Regards,
> Shaun
>
>
> -----Original Message-----
> From: mreutegg@day.com [mailto:mreutegg@day.com] On Behalf Of Marcel
> Reutegger
> Sent: 10 June 2009 08:12
> To: users
> Subject: Re: Should Lucene index file size reduce when items are deleted?
>
> Hi,
>
> 2009/6/9 Shaun Barriball <sbarriba@yahoo.co.uk>:
>> Hi Alex et al,
>> Noted on the performance comment which prompts the question:
>>  * what's the best way to monitor Lucene memory usage and performance to
>> determine bad queries or bloated indexes - in a MySql world you could use
>> Slow Query log?
>
> there's a debug log message for
> org.apache.jackrabbit.core.query.QueryImpl that includes the statement
> and the time it took to execute it. if you direct that into a separate
> log file and some tail/grep magic you should be able to get a log that
> shows slow queries.
>
>> And following up on the Lucene index size question.
>> * Is there a way to force Jackrabbit to clean up the Lucene indexes -
> assume
>> we're looking to consolidate disk space for example - rather than just
>> waiting for the nodes to merge?
>
> no, there's currently no such tool. however I consider this a useful
> enhancement.
>
>> For example:
>> * Is there a way to ask JackRabbit to call
>>
>
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/index/IndexWriter.
>> html#optimize()?
>
> no, there isn't.
>
>> * If we delete the "index" directory will JackRabbit happily reconstruct
a
>> consolidated index from scratch?
>
> yes, it will. that's currently the only way to get an index with all
> segments optimized.
>
>> Some of the content in our JackRabbit repository is high volume and
fairly
>> transient lasting only a few weeks before being deleted hence the index
>> question is more relevant for us.
>
> In general short living content is very well purged (not just marked
> as deleted) from the index because the merge policy is generational.
> the longer an item lives the harder it gets to purge it from the
> index. it's somewhat similar to garbage collection in java. once an
> object is in perm space it is more expensive to collect it.
>
> I currently see two options how jackrabbit could better handle your case.
>
> - introduce a method that lets you trigger an index optimization (as
> you suggested)
> - introduce a threshold for deleted nodes to live nodes ratio where an
> index segment is automatically optimized
>
> at the moment I prefer the latter because it does not require manual
> interaction. WDYT?
>
> regards
>  marcel
>
>> Regards,
>> Shaun
>>
>>
>> -----Original Message-----
>> From: Alexander Klimetschek [mailto:aklimets@day.com]
>> Sent: 08 June 2009 13:12
>> To: users@jackrabbit.apache.org
>> Subject: Re: Should Lucene index file size reduce when items are deleted?
>>
>> On Mon, Jun 8, 2009 at 1:41 PM, Shaun Barriball<sbarriba@yahoo.co.uk>
> wrote:
>>> Thanks Marcel.
>>>
>>> From a performance and memory usage perspective, should we see the
>> benefits
>>> of the deletion immediately or is the Lucene performance linked to the
>> index
>>> file sizes (and therefore reliant on the merge happening)?
>>
>> Indexing structures such as the Lucene fulltext index tend to use more
>> disk space to drastically enhance access (query) performance.
>>
>> space performance != processing time performance
>>
>> Regards,
>> Alex
>>
>> --
>> Alexander Klimetschek
>> alexander.klimetschek@day.com
>>
>>
>
>


Mime
View raw message