lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: PhraseQuery Performance Issues [Lucene 2.9.0]
Date Fri, 19 Mar 2010 23:14:06 GMT
Nutch/Solr's CommonGrams is the right way to solve this.  It combines
frequent terms (eg stopwords) with adjacent terms.  So "the wizard of
oz" will be indexed eg as the_wizard wizard_of of_oz.  It'll require a
full re-index though, and you have to fixup searching so that the same
term expansion works.

Lucene does no caching that'd speed up PhraseQuery (unless that was
the very first query against the field since you opened the reader) --
this is probably the OS's IO cache.

Mike

On Fri, Mar 19, 2010 at 3:56 PM, Daniel Shane <shaned@lexum.umontreal.ca> wrote:
> I'm running a medium size web search with a index size just shy of 9GB with 800000 docs
in it.
>
> We are suing Lucene version 2.9.0 (we have not checked yet to see if this applies to
older versions as well).
>
> By looking at my logs, I'm finding that phrase queries are especially long to perform.
In our index, we do not remove stopwords, so things like "the" and "is" are getting indexed
on purpose.
>
> If I try a phrase search like "The The" it will take about 10 seconds in Luke to get
some results back, and a bit less afterwards (7sec).
>
> More complete phrases that match maybe only 1 document can also take >10 secs if they
have many stopwords in them.
>
> I was wondering if this a normal behavior considering the fact that we do not remove
stopwords?
>
> Also, on some phrase queries (not all), the difference between the first call and any
subsequent calls can be very big. For example, it could take 5 seconds to do one query and
then less than 1 second to perform it again.
>
> Does Lucene, by default, cache anything when a (phrase) query is made or is this simply
file system caching at work?
>
> If this is a normal behavior, I assume that the solution is either to remove stopwords
from the index or shard it and ParallelMultiSearch it.
>
> What do you think?
>
> Daniel Shane
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message