jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ard Schrijvers <a.schrijv...@onehippo.com>
Subject Re: Should Lucene index file size reduce when items are deleted?
Date Thu, 11 Jun 2009 12:49:41 GMT
Hello Shaun et al,

quite some time ago I wrote some document (actually the mail being
referenced) on why queries could slow down, see:

http://wiki.apache.org/jackrabbit/Performance

This should give you quite some insight on what might be slow.
Certainly consider the initial path constraint queries, and your
jackrabbit version i think also the respectDocumentOrder.

Now, that document only described performance. Another field is the
memory usage. There is obviously a correlation between how large your
disk lucene index is, though, to say that it is a good indication of
your memory usage of Lucene....not really (not at all actually) :-))

It is possible to trigger an OOM in the Lucene impl for just a 1000
nodes if you want. Just add to each node a field : mylongtext of 1 Mb
a piece.

Sort on the property, and see you memory jumping to > 1 Gb

Obviously, a stupid example, but, unfortunately, not really to much to
do about it except not sorting on large property fields...if you need
to sort on the title of 200.000 docs...you better sort on the
short_title (which I would prefer to be an index only property defined
in indexing_configuration, but I think people have different opinions
on this)

Furthermore, doing date range queries when there are a lot of unique
dates for the property can result in quite some pain.

Anyways, I am confident that your gain won't lie in trying to reduce
Lucene index size: 780 Mb is not much. Perhaps my pointers above, and
in the mail from http://wiki.apache.org/jackrabbit/Performance will
give you enough pointers.

Regards Ard

On Thu, Jun 11, 2009 at 1:37 PM, Shaun Barriball<sbarriba@yahoo.co.uk> wrote:
> Hi Ard,
> Firstly - hands-up - I'm ONLY familiar with the purpose of Lucene within
> JackRabbit, not with the key factors which affect its performance.
>
> I should have probably provided some additional background. We've been
> operationally seeing:
>  * some JCR queries gradually slowing over time as volume of content
> increases
>  * increased locking contention within ItemManager manipulating large
> numbers of nodes
>  * increased disk IO and CPU IO wait in some scenarios which correlates with
> lots of threads reading Lucene indexes
>
> Our theory, based on the symptoms and some searching, was that the memory
> footprint of Lucene for our application was increasingly contributing to the
> above....and we naively used the Lucene size on disk as a relative measure
> of the memory footprint that a particular JackRabbit workspace would have.
>
> Disk Space issue
> -----------------
> From your comments and Marcel's response I'm now clear that the size on disk
> it not 'necessarily' a concern for performance except that it IS:
>  a) a potential consideration in terms of amount of disk IO for 'bloated'
> index files and OS resources, and
>  b) it is an issue for disk space when you're deploying 100 workspaces and
> the index sizes are growing regardless of the archiving strategy. (I'll
> ignore explaining why we may have that many workspaces.)
>
> Performance Issue
> ------------------
> So going back to the original background issue you've provided a great
> insight to known performance issues with the JackRabbit/Lucene pairing, and
> you've referenced JackRabbit 1.4.5. So the key question is, short of us
> reviewing any lucene related commits to JackRabbit, does the latest 1.5.*
> release contain significant improvements over 1.4.5 in this area?
>
> Regards,
> Shaun
>
>
>
>
>
>
>
>
>
> -----Original Message-----
> From: Ard Schrijvers [mailto:a.schrijvers@onehippo.com]
> Sent: 11 June 2009 12:00
> To: users@jackrabbit.apache.org
> Subject: Re: Should Lucene index file size reduce when items are deleted?
>
> Hello,
>
> 780 MB for a Lucene index is not really big.  Obviously, a larger FS
> index won't make Lucene faster, but at the same time, performance
> should not be affected that much either. Why do you think optimizing
> would save you that much?
>
> Also this part:
>
> "The underlying concern primarily is performance and keeping the licence
> indexes small enough to fit in 100% in memory, disk space being a secondary
> consideration."
>
> I do not understand the license indexes keeping in memory?
>
> Furthermore, there certainly are memory issues. These though are also
> related to general Lucene issues imo: Solr has similar issues.
>
> For example: if you have 1.000.000 documents, do a query and sort on
> the title. Suppose a String title is on average in memory around 1 kb.
> When sorting on the title, first *all* title terms are read into an
> array in memory: 1 m * 1kb =3D 1 Gb memory...as Lucene readers are never
> being reopened in jackrabbi, this memory is *not* returned (only
> exception is when indexes merges).
>
> Similar reasoning holds for the memory usage of some other parts. I
> think for version 1.4.5 you can even fill all memory by sorting on non
> existing properties, as there would be created String arrays for the
> length of the lucene maxdoc, containing only null values...something
> like String[] s =3D new String[10000000] ) (suppose you have 1.000.000
> documents, where on average one document results in 10 nodes
> (versioning, subtree nodes, etc etc))
>
> I hope to have time to work on this in a couple of months. This is IMO
> the actual issue, where 780 Mb is imo not big, and optimizing won't
> give you what you would expect, at least, that is what I think...and
> am convinced of
>
> Ard
>
> ps We have Lucene indexes with JackRabbit up to 10 Gb. The size is not the
> issue
>
> On Thu, Jun 11, 2009 at 10:58 AM, Shaun Barriball<sbarriba@yahoo.co.uk>
> wrote:
>> Hi Marcel,
>>
>> Marcel wrote:
>> "In general short living content is very well purged...."
>> I guess it depends on the what constitutes "short lived" as that's a
>> relative term. I'm guessing minutes, hours or a few days = "short lived".
>>
>> As a real world example, much of our content is editorial which lives for
> 4,
>> 12 maybe 24 weeks in some cases. We recently decreased the time to live
> for
>> the archiving (deletion) for larger repositories by 50% (based on usage
>> analysis).
>> In one case we want from 200,000 editorial items (composites of 10s of JCR
>> nodes) down to 70,000 editorial items. The Lucene indexes stayed around
> the
>> same physical size pre and post archive at 780MB on disk.....hence the
>> original post.
>>
>> Marcel wrote:
>> "- introduce a method that lets you trigger an index optimization (as
>> you suggested)
>> - introduce a threshold for deleted nodes to live nodes ratio where an
>> index segment is automatically optimized
>>
>> at the moment I prefer the latter because it does not require manual
>> interaction. WDYT?"
>>
>> We'd love to have some insight into the state of the Lucene indexes as
> well
>> as the ability to influence that state in terms of house keeping.
>> JMX, as suggested by James, would seem to be the natural way to do that
> (as
>> it integrates nicely with enterprise monitoring solutions). I think this
>> could be part of a wider instrumentation strategy discussion on Jackrabbit
>> looking at caching et al.
>>
>> Automated optimization based on a configured threshold is very useful
>> providing that it has a low overhead - we know that things like Java
> garbage
>> collection can hurt performance if not configured correctly. So definitely
>> "yes" to your "introduce a method" question and "possibly" to the
> automated
>> solution if we know it will be light.
>>
>> Regards,
>> Shaun
>>
>>
>> -----Original Message-----
>> From: mreutegg@day.com [mailto:mreutegg@day.com] On Behalf Of Marcel
>> Reutegger
>> Sent: 10 June 2009 08:12
>> To: users
>> Subject: Re: Should Lucene index file size reduce when items are deleted?
>>
>> Hi,
>>
>> 2009/6/9 Shaun Barriball <sbarriba@yahoo.co.uk>:
>>> Hi Alex et al,
>>> Noted on the performance comment which prompts the question:
>>>  * what's the best way to monitor Lucene memory usage and performance to
>>> determine bad queries or bloated indexes - in a MySql world you could use
>>> Slow Query log?
>>
>> there's a debug log message for
>> org.apache.jackrabbit.core.query.QueryImpl that includes the statement
>> and the time it took to execute it. if you direct that into a separate
>> log file and some tail/grep magic you should be able to get a log that
>> shows slow queries.
>>
>>> And following up on the Lucene index size question.
>>> * Is there a way to force Jackrabbit to clean up the Lucene indexes -
>> assume
>>> we're looking to consolidate disk space for example - rather than just
>>> waiting for the nodes to merge?
>>
>> no, there's currently no such tool. however I consider this a useful
>> enhancement.
>>
>>> For example:
>>> * Is there a way to ask JackRabbit to call
>>>
>>
> http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/index/IndexWriter.
>>> html#optimize()?
>>
>> no, there isn't.
>>
>>> * If we delete the "index" directory will JackRabbit happily reconstruct
> a
>>> consolidated index from scratch?
>>
>> yes, it will. that's currently the only way to get an index with all
>> segments optimized.
>>
>>> Some of the content in our JackRabbit repository is high volume and
> fairly
>>> transient lasting only a few weeks before being deleted hence the index
>>> question is more relevant for us.
>>
>> In general short living content is very well purged (not just marked
>> as deleted) from the index because the merge policy is generational.
>> the longer an item lives the harder it gets to purge it from the
>> index. it's somewhat similar to garbage collection in java. once an
>> object is in perm space it is more expensive to collect it.
>>
>> I currently see two options how jackrabbit could better handle your case.
>>
>> - introduce a method that lets you trigger an index optimization (as
>> you suggested)
>> - introduce a threshold for deleted nodes to live nodes ratio where an
>> index segment is automatically optimized
>>
>> at the moment I prefer the latter because it does not require manual
>> interaction. WDYT?
>>
>> regards
>>  marcel
>>
>>> Regards,
>>> Shaun
>>>
>>>
>>> -----Original Message-----
>>> From: Alexander Klimetschek [mailto:aklimets@day.com]
>>> Sent: 08 June 2009 13:12
>>> To: users@jackrabbit.apache.org
>>> Subject: Re: Should Lucene index file size reduce when items are deleted?
>>>
>>> On Mon, Jun 8, 2009 at 1:41 PM, Shaun Barriball<sbarriba@yahoo.co.uk>
>> wrote:
>>>> Thanks Marcel.
>>>>
>>>> From a performance and memory usage perspective, should we see the
>>> benefits
>>>> of the deletion immediately or is the Lucene performance linked to the
>>> index
>>>> file sizes (and therefore reliant on the merge happening)?
>>>
>>> Indexing structures such as the Lucene fulltext index tend to use more
>>> disk space to drastically enhance access (query) performance.
>>>
>>> space performance != processing time performance
>>>
>>> Regards,
>>> Alex
>>>
>>> --
>>> Alexander Klimetschek
>>> alexander.klimetschek@day.com
>>>
>>>
>>
>>
>
>

Mime
View raw message