lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Shane <sha...@LEXUM.UMontreal.CA>
Subject PhraseQuery Performance Issues [Lucene 2.9.0]
Date Fri, 19 Mar 2010 20:56:27 GMT
I'm running a medium size web search with a index size just shy of 9GB with 800000 docs in
it.

We are suing Lucene version 2.9.0 (we have not checked yet to see if this applies to older
versions as well).

By looking at my logs, I'm finding that phrase queries are especially long to perform. In
our index, we do not remove stopwords, so things like "the" and "is" are getting indexed on
purpose.

If I try a phrase search like "The The" it will take about 10 seconds in Luke to get some
results back, and a bit less afterwards (7sec). 

More complete phrases that match maybe only 1 document can also take >10 secs if they have
many stopwords in them.

I was wondering if this a normal behavior considering the fact that we do not remove stopwords?


Also, on some phrase queries (not all), the difference between the first call and any subsequent
calls can be very big. For example, it could take 5 seconds to do one query and then less
than 1 second to perform it again. 

Does Lucene, by default, cache anything when a (phrase) query is made or is this simply file
system caching at work?

If this is a normal behavior, I assume that the solution is either to remove stopwords from
the index or shard it and ParallelMultiSearch it.

What do you think?

Daniel Shane




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message