lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Search speed
Date Tue, 02 Nov 2004 08:04:44 GMT
On Monday 01 November 2004 21:02, Jeff Munson wrote:
> I'm looking for tips on speeding up searches since I am a relatively new
> user of Lucene.  
> 
> I've created a single index with 4.5 million documents.  The index has
> about 22 fields and one of those fields is the contents of the body tag
> which can range from 5K to 35K.  When I create the field (named
> "contents") that houses the contents of the body tag, the field is
> stored, indexed, and tokenized.  The term position vectors are not
> stored.  
> 
> Single word searches return pretty fast, but when I try phrases,
> searching seems to slow considerably.  When constructing the query I am
> using the standard query object where analyzer is the StandardAnalyzer:
> 
> Code Example:
> Query objQuery = QueryParser.parse(sSearchString, "contents", analyzer);
> 
> For example, the following query,  contents:Zanesville, it returns over
> 163,000 hits in 78 milliseconds.  
> 
> However, if I use this query, contents:"all parts including picture tube
> guaranteed", it returns hits in 2890 millseconds.  Other phrases take
> longer as well.  
> 
> My question is, are there any indexing tips (storing term vectors?) or
> query tips that I can use to speed up the searching of phrases?

Term vectors should not influence search times for phrases.

What you're seeing is this: for each term in your query Lucene
has to walk all the documents containing the term. For a single
term there is no speed problem because the document set for the term
is stored in a compact way on disk.
For multiple terms with large document sets the disk head needs to
move between the document sets of the terms because all sets
need to be walked synchronously over the documents to compute
the document scores.
For phrases even more disk accesses are needed to access the
term positions within the documents.
Normally the disk head seeks are degrading the performance.

One way to avoid the disk head seeks is to use fewer terms in the phrases.
Another way is to avoid using the term positions by querying for words
instead of phrases.

In case you have hardware/resources there are more options
like using faster disks and/or using RAM for critical parts of the index.
Lucene can use extra RAM in various ways. To configure that one may have
to do some java coding. Profiling can guide you there.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message