lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@onehippo.com>
Subject RE: Performance of never optimizing
Date Mon, 03 Nov 2008 10:43:10 GMT

Hello Justus, Chris and Otis,

IIRC Ocean [1] by Jason Rutherglen addresses the issue for real time
searches on large data sets. A conceptually comparable implementation is
done for Jackrabbit, where you can see an enlighting picture over here
[2]. In short: 

1) IndexReaders are opened only once and *never* reopened
(ReadOnlyIndexReader)
2) Deletions are persisted in the CommittableIndexReader and reflected
in a in memory BitSet (which is combined with the ReadOnlyIndexReader to
reflect deletions: note that deletions are thus reflected without
re-openening an index reader)
3) New documents are added to an in memory index
4) Searching is done in the CombinedIndexReader combining (1), (2) and
(3)
5) Index merging works similar to normal segment merging within one
single lucene index.

This mechanism helps you with instant reflection of changes without
having to reopen index readers.

Hope this helps.

Regards Ard

[1]
http://wiki.apache.org/lucene-java/OceanRealtimeSearch?highlight=(GData)
[2] http://jackrabbit.apache.org/index-readers.html

> 
> Hi, Justus,
> 
> I had met with very similar problems as JIRA has, which has 
> high modification and on a large data volume. It's a pretty 
> common use case for Lucene.
> 
> The way I dealt with high rate of modification is to create a 
> secondary in-memory index. And only persist documents older 
> than a period of time.
> So searching will need to combine results from two indexes. 
> It's a bit complicated when creating the index, but it's 
> worth well to save the extra IO-heavy merging and to improve 
> response time, especially the ability to search right away 
> with just added documents.
> 
> BTW: JIRA is great!
> 
> --
> Chris Lu
> -------------------------
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes: 
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database
> _Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per 
> request) got 2.6 Million Euro funding!
> 
> Justus Pendleton wrote:
> > Howdy,
> >
> > I have a couple of questions regarding some Lucene benchmarking and 
> > what the results mean[3]. (Skip to the numbered list at the 
> end if you 
> > don't want to read the lengthy exegesis :)
> >
> > I'm a developer for JIRA[1]. We are currently trying to get 
> a better 
> > understanding of Lucene, and our use of it, to cope with 
> the needs of 
> > our larger customers. These "large" indexes are only a 
> couple hundred 
> > thousand documents but our problem is compounded by the 
> fact that they 
> > have a relatively high rate of modification (=delete+insert of new
> > document) and our users expect these modification to show 
> up in query 
> > results pretty much instantly.
> >
> > Our current default behaviour is a merge factor of 4. We perform an 
> > optimization on the index every 4000 additions. We also perform an 
> > optimize at midnight. Our fundamental problem is that these 
> > optimizations are locking the index for unacceptably long 
> periods of 
> > time, something that we want to resolve for our next major release, 
> > hopefully without undermining search performance too badly.
> >
> > In the Lucene javadoc there is a comment, and a link to a 
> mailing list 
> > discussion[2], that suggests applications such as JIRA should never 
> > perform optimize but should instead set their merge factor very low.
> >
> > In an attempt to understand the impact of a) lowering the 
> merge factor 
> > from 4 to 2 and b) never, ever optimizing on an index (over 
> the course 
> > of years and millions of additions/updates) I wanted to try to 
> > benchmark Lucene.
> >
> > I used the contrib/benchmark framework and wrote a small algorithm 
> > that adds documents to an index (using the Reuters doc generator), 
> > does a search, does an optimize, then does another search. All the 
> > pretty pictures can be seen at:
> >
> >   http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs
> >
> > I have several questions, hopefully they aren't 
> overwhelming in their 
> > quantity :-/
> >
> > 1. Why does the merge factor of 4 appear to be faster than 
> the merge 
> > factor of 2?
> >
> > 2. Why does non-optimized searching appear to be faster 
> than optimized 
> > searching once the index hits ~500,000 documents?
> >
> > 3. There appears to be a fairly sizable performance drop across the 
> > board around 450,000 documents. Why is that?
> >
> > 4. Searching performance appears to decrease towards a fairly 
> > pessimistic 20 searches per second (for a relatively simple search).
> > Is this really what we should expect long-term from Lucene?
> >
> > 5. Does my benchmark even make sense? I am far from an expert on 
> > benchmarking so it is possible I'm not measuring what I think I am 
> > measuring.
> >
> > Thanks in advance for any insight you can provide. This is an area 
> > that we very much want to understand better as Lucene is a 
> key part of 
> > JIRA's success,
> >
> > Cheers,
> > Justus
> > JIRA Developer
> >
> > [1]: http://www.atlassian.com
> > [2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895
> > [3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs
> >
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message